Job Bookmark creating null temp file in AWS Glue

0

I have below python script where i have enabled job bookmark and also provided path in temp directory to create bookmark files. The problem is its creating bookmark json file which is empty. I dont understand what might be issue in my python script. I have seen few AWS article but could not understand the issue in code.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import datetime


args = getResolvedOptions(sys.argv, ['target_BucketName', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

outputbucketname = args['target_BucketName']

timestamp = datetime.datetime.now().strftime("%Y%m%d")
filename = f"aks{timestamp}"
output_path = f"{outputbucketname}/{filename}"


# Script generated for node AWS Glue Data Catalog
AWSGlueDataCatalog_node1712075257312 = glueContext.create_dynamic_frame.from_options(connection_type="s3", format="csv", database="obsdatachecks", table_name="_obst_rw_omuc_baag__obst1_aks", additional_options = {"jobBookmarkKeys":["modtime"],"jobBookmarkKeysSortOrder":"desc"}, transformation_ctx="AWSGlueDataCatalog_node1712075257312")

# Script generated for node Amazon S3
AmazonS3_node1712075284688 = glueContext.write_dynamic_frame.from_options(frame=AWSGlueDataCatalog_node1712075257312, connection_type="s3", format="csv", format_options={"separator": "|"}, connection_options={"path": output_path, "compression": "gzip", "partitionKeys": []}, transformation_ctx="AmazonS3_node1712075284688")


job.commit()
RahulD
asked 25 days ago35 views
1 Answer
0

The absence of temporary file creation is likely due to an incorrect implementation of the Glue bookmark feature. In your current configuration, you're attempting to read from an S3 source while specifying database and table details. Additionally, you are passing the jobBookmarkKeys and jobBookmarkKeysSortOrder parameters, which are not applicable for S3 data sources.

When reading from an S3 source, Glue bookmarks automatically utilise the last modified timestamp of files to determine whether they have been processed in previous runs. Therefore, explicit specification of bookmark keys is unnecessary.

Recommended Implementation:

To correctly implement Glue bookmarks with an S3 data source, please consider the following approaches:

  • Reading directly from S3 source:
# Create a dynamic frame from the S3 source 
datasource0 = glueContext.create_dynamic_frame.from_options(
    connection_type='s3',
    connection_options={
        'paths': [f's3://{s3_bucket}/{s3_prefix}'],
        'recurse': True
    },
    format='csv',
    format_options={'withHeader': True, 'separator': ','},
    transformation_ctx='datasource0'
)
  • Reading from the Glue Data Catalog:
# Create a dynamic frame from the Glue Data Catalog pointing to s3 data source.
AWSGlueDataCatalog_node = glueContext.create_dynamic_frame.from_catalog(
    database="db_name",
    table_name="table_name",
    transformation_ctx="AWSGlueDataCatalog_node"
)

By implementing either of these approaches, you should observe the expected behaviour of Glue bookmarks, including the creation of temporary files during processing. For details on glue bookmark, please check below document link

https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

answered 25 days ago
  • @Sujeet thanks for the reply. I am still bit confused sorry. Actually i have "aks" table in oracle database. The daily data inserted in this "aks" table with new modtime. Using job bookmark in glue, I want to perform incremental load logic on this oracle table such that it should create files in s3 bucket with the new data only. Thats why i tried to use modtime field as job bookmark key.

    When i use create_dynamic_frame.from_catalog then it creates files for entire "aks" table is s3 bucket.

    How can i achieve this because i have seen few aws article for job bookmark and they mentioned example of csv file loading and not from oracle database table. If you can provide solution with my example could be really helpful for better understanding. Thanks again.

  • Your ETL job that copies data to S3 is exporting the entire dataset each time, rather than just new or updated records. This suggests the glue bookmarking feature may not be functioning as expected for this table.

    When reading from the Oracle database, there are a few ways to specify job bookmark keys.

    1. When using create_dynamic_frame.from_catalog(), you can specify jobBookmarkKeys and jobBookmarkKeysSortOrder using the additional_options parameter.
    2. When using create_dynamic_frame.from_options(), you can specify them using the connection_options parameter.

    If job bookmark keys have not been explicitly defined, by default AWS Glue will use the primary key column as the bookmark key, provided it monotonically increases or decreases with no gaps. The primary key column must be sequentially ordered for reliable bookmarking.

    For more details on configuring job bookmarking with a JDBC source, refer to the following AWS documentation:

    These articles provide useful information on how Glue job bookmarking works with JDBC sources.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions