In my study case, I have data coming from a relational database (which stores data directly from the product application) and it sends files into S3 by using AWS DMS with CDC logs;
Usually, I manage data with the simple truncate and insert model into the analytics database, having the ETL proccess made by AWS Glue Scripts. But I think that this process (consuming all the data and then inserting it) can cost me more than I need for this job to be solved;
Since then, I´m trying to understand if there is an option which I can use jobs booksmark to retrieve just the new files and then realize an upsert into the analytics databases! (it is important to be an upsert since it´s not a log table);
I thought it would be good to use a data catalog table to represent the analytics table and bring the new data with jobs booksmark, so I can join them in a upsert-style transformation and then truncate and insert into the final table (reducing time to read the files in the S3 Folder);
It´s important to notice that I don´t need near real-time data in my company, so my biggest deal is to reduce DPU hours as much as I can!
Is this a good idea? Is there any other option considering this case to reduce my DPU hours in my glue environment?