AWS glue creates full data from oracle source to s3 target every time even when there is a job bookmark

0

I have below python script in AWS Glue job. For incremental load logic i have now set the Job bookmark option to enable and also provided temporary path.

I have "btb" table in oracle database. The daily data inserted in this "btb" table with new modtime. Using job bookmark in glue, I want to perform incremental load logic on this oracle table such that it should create files in s3 bucket with the new data only. Thats why i tried to use modtime field as job bookmark key.

When i use create_dynamic_frame.from_catalog then it creates files for entire "btb" table in s3 bucket everytime. But i want only files with new data to be created in s3 bucket. I have seen few aws job bookmark article but could not find any help.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import datetime


args = getResolvedOptions(sys.argv, ['target_BucketName', 'JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

outputbucketname = args['target_BucketName']

timestamp = datetime.datetime.now().strftime("%Y%m%d")
filename = f"btb{timestamp}"
output_path = f"{outputbucketname}/{filename}"


# Script generated for node AWS Glue Data Catalog
AWSGlueDataCatalog_node1712075257312 = glueContext.create_dynamic_frame.from_catalog(
    database="checks", 
    table_name="_bst_rw_omuc_baag__bst1_btb",
    additional_options = {"jobBookmarkKeys":["modtime"],"jobBookmarkKeysSortOrder":"desc"}, 
    transformation_ctx="AWSGlueDataCatalog_node1712075257312")

# Script generated for node Amazon S3
AmazonS3_node1712075284688 = glueContext.write_dynamic_frame.from_options(frame=AWSGlueDataCatalog_node1712075257312, connection_type="s3", format="csv", format_options={"separator": "|"}, connection_options={"path": output_path, "compression": "gzip", "partitionKeys": []}, transformation_ctx="AmazonS3_node1712075284688")


job.commit()
RahulD
asked 25 days ago37 views
1 Answer
0

Hello,

From the below code snippet, I could see that you are using both jobBookmarkKeys and jobBookmarkKeysSortOrder parameters.

additional_options = {"jobBookmarkKeys":["modtime"],"jobBookmarkKeysSortOrder":"desc"}, 

Please ensure that the bookmark key meets the criteria outlined here for JDBC sources:

  1. For each table, AWS Glue uses one or more columns as bookmark keys to determine new and processed data. The bookmark keys combine to form a single compound key.
  2. AWS Glue by default uses the primary key as the bookmark key, provided that it is sequentially increasing or decreasing (with no gaps).
  3. You can specify the columns to use as bookmark keys in your AWS Glue script. For more information about using Job bookmarks in AWS Glue scripts, see Using job bookmarks.
  4. AWS Glue doesn't support using columns with case-sensitive names as job bookmark keys.

I would recommend you to ensure that the column being used as a bookmark key has values with sequentially increasing numbers (like serial number) and avoid using jobBookmarkKeys as a string column, to ensure that the job bookmarks work as expected. Refer the following repost article as well.

In order for me to troubleshoot further, by taking a look at the logs in the back-end, please feel free to open a support case with AWS using the following link with the sanitized script, job run id and we would be happy to help.

AWS
SUPPORT ENGINEER
answered 24 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions