Glue Best Practices with CDC Data into S3

0

In my study case, I have data coming from a relational database (which stores data directly from the product application) and it sends files into S3 by using AWS DMS with CDC logs;

Usually, I manage data with the simple truncate and insert model into the analytics database, having the ETL proccess made by AWS Glue Scripts. But I think that this process (consuming all the data and then inserting it) can cost me more than I need for this job to be solved;

Since then, I´m trying to understand if there is an option which I can use jobs booksmark to retrieve just the new files and then realize an upsert into the analytics databases! (it is important to be an upsert since it´s not a log table);

I thought it would be good to use a data catalog table to represent the analytics table and bring the new data with jobs booksmark, so I can join them in a upsert-style transformation and then truncate and insert into the final table (reducing time to read the files in the S3 Folder);

It´s important to notice that I don´t need near real-time data in my company, so my biggest deal is to reduce DPU hours as much as I can!

Is this a good idea? Is there any other option considering this case to reduce my DPU hours in my glue environment?

1개 답변
1
수락된 답변

You don't need a catalog table to use bookmarks.
If you intend to make updates, you'll probably need a DLF like Iceberg or Hudi, which allow merge operations. Otherwise you would need to regenerate the whole table and impact users. Then to improve performance it depends how you partition/index the destination table and how many files are affected by the updates, also the less frequent updates the more efficient it will be.

profile pictureAWS
전문가
답변함 6달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠