Glue Best Practices with CDC Data into S3

0

In my study case, I have data coming from a relational database (which stores data directly from the product application) and it sends files into S3 by using AWS DMS with CDC logs;

Usually, I manage data with the simple truncate and insert model into the analytics database, having the ETL proccess made by AWS Glue Scripts. But I think that this process (consuming all the data and then inserting it) can cost me more than I need for this job to be solved;

Since then, I´m trying to understand if there is an option which I can use jobs booksmark to retrieve just the new files and then realize an upsert into the analytics databases! (it is important to be an upsert since it´s not a log table);

I thought it would be good to use a data catalog table to represent the analytics table and bring the new data with jobs booksmark, so I can join them in a upsert-style transformation and then truncate and insert into the final table (reducing time to read the files in the S3 Folder);

It´s important to notice that I don´t need near real-time data in my company, so my biggest deal is to reduce DPU hours as much as I can!

Is this a good idea? Is there any other option considering this case to reduce my DPU hours in my glue environment?

1 Answer
1
Accepted Answer

You don't need a catalog table to use bookmarks.
If you intend to make updates, you'll probably need a DLF like Iceberg or Hudi, which allow merge operations. Otherwise you would need to regenerate the whole table and impact users. Then to improve performance it depends how you partition/index the destination table and how many files are affected by the updates, also the less frequent updates the more efficient it will be.

profile pictureAWS
EXPERT
answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions