Glue Best Practices with CDC Data into S3

0

In my study case, I have data coming from a relational database (which stores data directly from the product application) and it sends files into S3 by using AWS DMS with CDC logs;

Usually, I manage data with the simple truncate and insert model into the analytics database, having the ETL proccess made by AWS Glue Scripts. But I think that this process (consuming all the data and then inserting it) can cost me more than I need for this job to be solved;

Since then, I´m trying to understand if there is an option which I can use jobs booksmark to retrieve just the new files and then realize an upsert into the analytics databases! (it is important to be an upsert since it´s not a log table);

I thought it would be good to use a data catalog table to represent the analytics table and bring the new data with jobs booksmark, so I can join them in a upsert-style transformation and then truncate and insert into the final table (reducing time to read the files in the S3 Folder);

It´s important to notice that I don´t need near real-time data in my company, so my biggest deal is to reduce DPU hours as much as I can!

Is this a good idea? Is there any other option considering this case to reduce my DPU hours in my glue environment?

1 Antwort
1
Akzeptierte Antwort

You don't need a catalog table to use bookmarks.
If you intend to make updates, you'll probably need a DLF like Iceberg or Hudi, which allow merge operations. Otherwise you would need to regenerate the whole table and impact users. Then to improve performance it depends how you partition/index the destination table and how many files are affected by the updates, also the less frequent updates the more efficient it will be.

profile pictureAWS
EXPERTE
beantwortet vor 6 Monaten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen