- Newest
- Most votes
- Most comments
The absence of temporary file creation is likely due to an incorrect implementation of the Glue bookmark feature. In your current configuration, you're attempting to read from an S3 source while specifying database and table details. Additionally, you are passing the jobBookmarkKeys
and jobBookmarkKeysSortOrder
parameters, which are not applicable for S3 data sources.
When reading from an S3 source, Glue bookmarks automatically utilise the last modified timestamp of files to determine whether they have been processed in previous runs. Therefore, explicit specification of bookmark keys is unnecessary.
Recommended Implementation:
To correctly implement Glue bookmarks with an S3 data source, please consider the following approaches:
- Reading directly from S3 source:
# Create a dynamic frame from the S3 source datasource0 = glueContext.create_dynamic_frame.from_options( connection_type='s3', connection_options={ 'paths': [f's3://{s3_bucket}/{s3_prefix}'], 'recurse': True }, format='csv', format_options={'withHeader': True, 'separator': ','}, transformation_ctx='datasource0' )
- Reading from the Glue Data Catalog:
# Create a dynamic frame from the Glue Data Catalog pointing to s3 data source. AWSGlueDataCatalog_node = glueContext.create_dynamic_frame.from_catalog( database="db_name", table_name="table_name", transformation_ctx="AWSGlueDataCatalog_node" )
By implementing either of these approaches, you should observe the expected behaviour of Glue bookmarks, including the creation of temporary files during processing. For details on glue bookmark, please check below document link
https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
Relevant content
- asked a month ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 19 days ago
- AWS OFFICIALUpdated 3 years ago
@Sujeet thanks for the reply. I am still bit confused sorry. Actually i have "aks" table in oracle database. The daily data inserted in this "aks" table with new modtime. Using job bookmark in glue, I want to perform incremental load logic on this oracle table such that it should create files in s3 bucket with the new data only. Thats why i tried to use modtime field as job bookmark key.
When i use create_dynamic_frame.from_catalog then it creates files for entire "aks" table is s3 bucket.
How can i achieve this because i have seen few aws article for job bookmark and they mentioned example of csv file loading and not from oracle database table. If you can provide solution with my example could be really helpful for better understanding. Thanks again.
Your ETL job that copies data to S3 is exporting the entire dataset each time, rather than just new or updated records. This suggests the glue bookmarking feature may not be functioning as expected for this table.
When reading from the Oracle database, there are a few ways to specify job bookmark keys.
If job bookmark keys have not been explicitly defined, by default AWS Glue will use the primary key column as the bookmark key, provided it monotonically increases or decreases with no gaps. The primary key column must be sequentially ordered for reliable bookmarking.
For more details on configuring job bookmarking with a JDBC source, refer to the following AWS documentation:
These articles provide useful information on how Glue job bookmarking works with JDBC sources.