Glue Bookmark +1 source

0

Hello guys, i'm working with glue for first time. When i do a pyspark script who take from rds and load in s3 i noticed that only the first table had the bookmark working good. There is some limitation to one bookmark for job? Script example below:

# Script  table1
table1_node = glueContext.create_dynamic_frame.from_catalog(
    database="db",
    table_name="rds_table1",
    transformation_ctx="table1_node ",
    additional_options = {"jobBookmarkKeys":["ID"], "jobBookmarkKeysSortOrder":"asc"}
)

# Script table2
table2_node = glueContext.create_dynamic_frame.from_catalog(
    database="db",
    table_name="rds_table2",
    transformation_ctx="table2_node ",
    additional_options = {"jobBookmarkKeys":["id"], "jobBookmarkKeysSortOrder":"asc"}
)
# Script table1 to S3
table1_node = glueContext.write_dynamic_frame.from_options(
    frame=df_dyf_req,
    connection_type="s3",
    format="glueparquet",
    connection_options={
        "path": "s3:///",
        "partitionKeys": ["year", "month"],
    },
    format_options={"compression": "snappy"},
    transformation_ctx="req_node",
)
# Script table2 to S3
table2_node = glueContext.write_dynamic_frame.from_options(
    frame=df_dyf_act,
    connection_type="s3",
    format="glueparquet",
    connection_options={
        "path": "s3:///",
        "partitionKeys": ["year", "month"],
    },
    format_options={"compression": "snappy"},
    transformation_ctx="act_node",
)
job.commit()

thanks guys

profile picture
lebuosi
asked 7 months ago146 views
1 Answer
1
Accepted Answer

Hello,

In general , when using Job bookmarks with a JDBC data source, a limitation of bookmarks is that job bookmarks will only capture new rows and not changes to existing rows. This is because, even with a composite key, Job bookmarks will not retrieve values for a key that are lower than the previously processed bookmark value. Please see this limitation as described in the following documentation.

https://docs.aws.amazon.com/glue/latest/dg/glue-troubleshooting-errors.html#error-job-bookmarks-reprocess-data

If you are using a relational database (a JDBC connection) for the input source, job bookmarks work only if the table's primary keys are in sequential order. Job bookmarks work for new rows, but not for updated rows. That is because job bookmarks look for the primary keys, which already exist. [1]

https://aws.amazon.com/blogs/big-data/load-data-incrementally-and-optimized-parquet-writer-with-aws-glue/

When you access relational databases using a JDBC connection…Job bookmarks can capture only newly added rows [2]

Note that this limitation only applies to JDBC sources and not source tables in S3.

From the shared script , I could see you have used "jobBookmarkKeys":["ID"] & "jobBookmarkKeys":["id"] and please make sure if the columns has case sensitivity on your JDBC source, if case sensitivity present on "id/ID" column both keys are treated as same Key.

Since bookmark key works based on tranformation_ctx , please try using different name on both transformation_ctx and retry your Glue Job.

Thanks!

AWS
SUPPORT ENGINEER
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions