Best Practices for Migrating Large Flat Files from S3 to RDS MariaDB?

0

We are in the process of importing data. The data will be provided as flat files or csv and stored in an Amazon S3 bucket. Each file is expected to be approximately 2GB in size with around 200k records. Our target database is an RDS MariaDB instance.

Given the size and format of the data, we anticipate potential challenges in the migration process, especially concerning data transformation before importing into MariaDB.

I've considered using AWS services like Lambda, Glue, and DMS. While Lambda has limitations with large files, Glue seems more suitable for ETL processes. DMS, on the other hand, is primarily designed for database-to-database migrations, and I'm unsure about its efficiency and suitability for our flat file to database scenario.

I'd appreciate insights or recommendations on:

  1. The most efficient and scalable AWS service or combination of services for this migration.
  2. Any potential pitfalls or challenges we should be aware of.
  3. Best practices or experiences from those who've undertaken similar migrations.

Thank you in advance for your guidance!

1 Answer
1
Accepted Answer

Generally speaking, I would say that the most flexible and reliable solution, specifically for extract, transform and load operations, is AWS Glue. It is purpose built for this function. That said, it is more work than using DMS, and if DMS can meet your business requirements, I would suggest you give it a try first.

DMS, on the other hand, is primarily designed for database-to-database migrations, and I'm unsure about its efficiency and suitability for our flat file to database scenario.

Respectfully, I don't think this is an accurate characterization of what DMS is designed for. The product teams at AWS built DMS so that it could move data from a source to a target, in one-time fashion, or with ongoing replication. DMS added S3 as a source in March of 2019. This is not some infrequently used scenario for DMS, it is one of its primary use cases, heavily used, and it is a solid choice for this scenario based on what you shared so far. DMS can even do simple transformations in flight, but if you need extensive changes, that goes beyond DMS. I have personally configured DMS to move data from S3 with tables that have hundreds of millions of rows, and it doesn't break a sweat.

Pros and Cons of Using DMS in this Scenario:

Pros

  • Ease of Setup: DMS offers a relatively straightforward setup for such migrations.
  • Transformation Rules: Allows you to apply transformation rules to modify your source data before it reaches the target.

Cons

  • Limited Transformation: DMS might offer limited transformation capabilities compared to a full-fledged ETL tool like AWS Glue.
  • Error Handling: Handling errors might be slightly more complicated in DMS compared to Glue.

I did have a question though... why are you moving the data from S3 at all?

If you need to query the data that is stored in S3, you can do so via Athena with any standard analytics suite, or via Redshift using Redshift Spectrum. There is a very real misconception that a relational database is going to smoke the performance of flattened data in S3 for your BI teams. If you are trying to build a data warehouse, data mart, or simply setup a dashboard (or dashboards) for this data, you may be creating unnecessary work for yourself. More data lakes are built on S3 than with any other technology in the world. Further, no database is going to give you the reliability of S3.

Optimizing Query Performance

If you have the ability to control the movement of data into S3, you can set it up so that the data is partitioned, and stored as a columnar based file format like parquet files. If you can't, you can setup an action so every time a file lands in the S3 bucket, it is converted to parquet, partitioned (e.g., by date), and then stored in a secondary location. Using something like AWS Lake Formation (https://aws.amazon.com/lake-formation/faqs/), will provide you a framework for using S3 and its related services. This can provide a simple to manage solution, with better access control options, and better data governance capabilities baked in.

Again this all comes down to what your use case is, and what technologies you want to use, but hopefully this helps.

profile picture
answered 8 months ago
profile picture
EXPERT
reviewed 8 months ago
  • Thanks for the insightful response! We're migrating an external database to our AWS RDS MariaDB, which serves as our transactional database. A third party will provide data in CSV, which we'll need to transform and import. We're also considering importing the raw data into a staging schema and handling transformations database-side (ELT). This will be a one-time task, with expected downtime on both ends. In this context, any thoughts on using mysqlimport?

  • Hi Ozzy, thanks! This additional detail helps a lot, as a transactional database, RDS MariaDB is an excellent choice. I think the key will be how much transformation is needed. If you need to add columns with timestamps, split columns, skip columns, append or prepend data, or even do some basic work to normalize columns, you can handle this all with DMS transformations in flight as the data moves. If the transformations required are simple joins that can be processed via SQL, a simple CREATE TABLE AS option of course works (https://www.techonthenet.com/mariadb/tables/create_table_as.php) and is not a lot of work. You could start by creating a view, make sure that meets your needs, and then create the table(s) and drop the original migrated table(s). You also have many more options with increasing effort to be able to transform the data database-side. If you need to do really robust transformations (e.g., very complex joins and processing complex data transformations step-by-step), then I think your instinct to use a true ETL solution like Glue, using something like Spark jobs, may be your best option, but likely is the most effort to setup. If it is just a bit more complexity than you want to handle database-side, you might consider the visual solution (Glue Studio) that can help you get started quickly, though you often have to dig into the jobs it creates for you and add nuance to get to the output you want. Best of luck to you!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions