Data Inconsistency in Datalake: Glue Job Bookmark Issue

Question

Hello, I am relatively new to Glue and encountering some challenges with Glue ETL.

Our setup involves a datalake that retrieves data from a backend database as its source. This datalake is subsequently queried with Athena and partitioned for a portal.

However, we are experiencing inconsistency in the data fetched from the source due to the Glue job bookmark feature. This feature overlooks any new input in a field after the completion of the ETL job run. For instance, if a user checks in at 10 pm and checks out at 1 am, and the job runs at midnight, it records the check-in time. Upon the subsequent job run, it disregards the checkout time from the source database because the check-in ID (used as the bookmark field) has been bookmarked by Glue.

This inconsistency is causing significant issues in the data presented on the portal, particularly with the absence of checkout time records in the table.

I would greatly appreciate a more accommodating solution to address this issue, as it has significantly impacted a major business operation.

Thank you for your attention to this matter.

Answer

I'm guessing you are using JDBC because you use a bookmark field, which works very differently from S3 bookmarks.   
By definition, Glue relies on the bookmark field you specify and will only pick increments.   
Meaning that you make changes to already ingested column, you have to update the bookmark field (for instance an upgrade trigger that updates the timestamp), so it gets ingested again (and you need to deal with the duplication/update at the destination)

Data Inconsistency in Datalake: Glue Job Bookmark Issue

Relevant content