Skip to content

How to rewind Job bookmark programatically

1

I am using Glue service to read the files and migrating to database. The same script is run for 30-40 tables. The S3 path and table name are changing dynamically through a csv file I am passing. Doing this to avoid creating that many jobs. Each datasource being read includes their own dedicated transformation_ctx property. Next time when the job runs again it picks where the tables where last read.The problem I am facing is when any of the table load fails. For those too the file was read already but write did not happen, due to which I would lose the data which was read but not written for that specific table in the next run. Below are the possibilities I have come up with: 1. Make the entire job fail if any of the table load is failing 2. Add notification for failed table sent over email (so that I could troubleshoot) and rewind bookmark for the failed table and process next tables.

I am unable to achieve the second option, as I don't want to stop other tables from being written. I would like to rewind the bookmark by code or reprocess files for that table only, not all tables.

Can I achieve this with any other way?

asked 3 years ago2.4K views
2 Answers
0

I had a similar requirement. Below is how I managed to make it work:

You will have to make some changes in your code and only use AWS CLI to reset the bookmarks. When you call job.init() instead of passing the default JOB_NAME arg, pass a unique name for each table you are processing, for example whatever you set as the transformation_ctx since it will be unique. With this, Glue will create a bookmark for each table based on the name you set.

As you would guess, resetting the bookmark in glue UI will fail since glue will assume the bookmark jobname is the same as the original job name you see in the Glue UI. So to reset the bookmark to a job run, you will need to use the CLI and replace the --Jobname parameter with the name set directly in the script.

Example: aws glue reset-job-bookmark --job-name <job_name_in_script> --run-id jr_xxxxxxxxxxx

answered a year ago
-1

You can only rewind job bookmarks to any previous job run - https://docs.aws.amazon.com/cli/latest/reference/glue/reset-job-bookmark.html Since there are multiple tables being processed in a single job, this would mean reprocessing data for all of the tables - even for those tables where this issue didn't happen. It seems like the first option would be better - to make the entire job fail even if one table load is failing.

AWS
answered 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.