S3 job bookmarks implementation

0

Does AWS provide implementation details for the S3 bookmarking logic in Glue?

I have a bucket with tens of thousands of partitions (year, month, day, device_id) and each file inside the partition holds a number of events

When I run a job, how does the bookmarking logic call into the S3 APIs to determine which files need to be processed? I understand that it uses ListObjects or ListObjectsV2 and checks the modified time of each file, but my concern is when there are millions of files, how does Glue optimize this listing behaviour?

I would have thought that perhaps it uses the objectCount or recordCount properties of each partition to check first if there are new objects to be processed, before calling ListObjects, but I just ran some testing and confirmed that this does not occur. ie. if I upload a file to S3 and re-run the job, without running the crawler, it still picks up the new files (which have not yet been picked up by the crawler, nor added as aggregate metadata to the partition properties)

1 Answer
1
Accepted Answer

AWS does not provide implementation details for S3 Bookmarks. Given that, AWS documentation does provide some information that can be helpful to keep when implementing bookmarks.

For Amazon S3 input sources, AWS Glue job bookmarks check the last modified time of the objects to verify which objects need to be reprocessed. If your input source data has been modified since your last job run, the files are reprocessed when you run the job again.

These are some best practices to keep in mind for Bookmarks with S3 as a source:

  • Ensure that the job bookmark options are enabled in job properties and max concurrency is 1
  • Job must have job.init() and job.commit() and jobname in the arguments (job.init(args['JOB_NAME'], args))
  • Use glueContext and transformation_ctx to enable bookmarks especially in the source. using sparkContext would not help with bookmarks
  • From the documentation about best practices, please note AWS suggests to read from catalog after crawler rather than using the from_options() method to read S3 files directly.

Use a catalog table with bookmarks for better partition management. Bookmarks works both for data sources from the Data Catalog or from options. However, it's difficult to remove/add new partitions with the from options approach. Using a catalog table with crawlers can provide better automation to track the newly added partitions and give you the flexibility to select particular partitions with a pushdown predicate.

  • Also from best practices documentation, it suggests using the S3 files lister or useS3ListImplementation. i.e.
  1. from_catalog(database = "database", table_name = "table", additional_options = {'useS3ListImplementation': True}, transformation_ctx = "datasource0") OR
  2. from_options(connection_type="s3", connection_options = {"paths": ["s3://input_path"], "useS3ListImplementation":True,"recurse":True}, format="json")

Use the AWS Glue Amazon S3 file lister for large datasets. A bookmark will list all files under each input partition and do the filering, so if there are too many files under a single partition the bookmark can run into driver OOM. Use the AWS Glue Amazon S3 file lister to avoid listing all files in memory at once.

profile picture
answered 2 months ago
  • Thanks for the answer ananthtm. I had been hoping that there were options available for optimizing the bookmarking but I may have to continue using Databrick's AutoLoader instead until AWS evolves their solution to be a little more efficient. Thanks again for the detailed info!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions