- Più recenti
- Maggior numero di voti
- Maggior numero di commenti
AWS does not provide implementation details for S3 Bookmarks. Given that, AWS documentation does provide some information that can be helpful to keep when implementing bookmarks.
For Amazon S3 input sources, AWS Glue job bookmarks check the last modified time of the objects to verify which objects need to be reprocessed. If your input source data has been modified since your last job run, the files are reprocessed when you run the job again.
These are some best practices to keep in mind for Bookmarks with S3 as a source:
- Ensure that the job bookmark options are enabled in job properties and max concurrency is 1
- Job must have
job.init()
andjob.commit()
and jobname in the arguments (job.init(args['JOB_NAME'], args)
) - Use glueContext and transformation_ctx to enable bookmarks especially in the source. using sparkContext would not help with bookmarks
- From the documentation about best practices, please note AWS suggests to read from catalog after crawler rather than using the from_options() method to read S3 files directly.
Use a catalog table with bookmarks for better partition management. Bookmarks works both for data sources from the Data Catalog or from options. However, it's difficult to remove/add new partitions with the from options approach. Using a catalog table with crawlers can provide better automation to track the newly added partitions and give you the flexibility to select particular partitions with a pushdown predicate.
- Also from best practices documentation, it suggests using the S3 files lister or useS3ListImplementation. i.e.
from_catalog(database = "database", table_name = "table", additional_options = {'useS3ListImplementation': True}, transformation_ctx = "datasource0")
ORfrom_options(connection_type="s3", connection_options = {"paths": ["s3://input_path"], "useS3ListImplementation":True,"recurse":True}, format="json")
Use the AWS Glue Amazon S3 file lister for large datasets. A bookmark will list all files under each input partition and do the filering, so if there are too many files under a single partition the bookmark can run into driver OOM. Use the AWS Glue Amazon S3 file lister to avoid listing all files in memory at once.
Contenuto pertinente
- AWS UFFICIALEAggiornata un anno fa
- AWS UFFICIALEAggiornata 2 anni fa
- AWS UFFICIALEAggiornata 2 anni fa
Thanks for the answer ananthtm. I had been hoping that there were options available for optimizing the bookmarking but I may have to continue using Databrick's AutoLoader instead until AWS evolves their solution to be a little more efficient. Thanks again for the detailed info!