Glue Bookmark +1 source

0

Hello guys, i'm working with glue for first time. When i do a pyspark script who take from rds and load in s3 i noticed that only the first table had the bookmark working good. There is some limitation to one bookmark for job? Script example below:

# Script  table1
table1_node = glueContext.create_dynamic_frame.from_catalog(
    database="db",
    table_name="rds_table1",
    transformation_ctx="table1_node ",
    additional_options = {"jobBookmarkKeys":["ID"], "jobBookmarkKeysSortOrder":"asc"}
)

# Script table2
table2_node = glueContext.create_dynamic_frame.from_catalog(
    database="db",
    table_name="rds_table2",
    transformation_ctx="table2_node ",
    additional_options = {"jobBookmarkKeys":["id"], "jobBookmarkKeysSortOrder":"asc"}
)
# Script table1 to S3
table1_node = glueContext.write_dynamic_frame.from_options(
    frame=df_dyf_req,
    connection_type="s3",
    format="glueparquet",
    connection_options={
        "path": "s3:///",
        "partitionKeys": ["year", "month"],
    },
    format_options={"compression": "snappy"},
    transformation_ctx="req_node",
)
# Script table2 to S3
table2_node = glueContext.write_dynamic_frame.from_options(
    frame=df_dyf_act,
    connection_type="s3",
    format="glueparquet",
    connection_options={
        "path": "s3:///",
        "partitionKeys": ["year", "month"],
    },
    format_options={"compression": "snappy"},
    transformation_ctx="act_node",
)
job.commit()

thanks guys

profile picture
lebuosi
已提问 7 个月前155 查看次数
1 回答
1
已接受的回答

Hello,

In general , when using Job bookmarks with a JDBC data source, a limitation of bookmarks is that job bookmarks will only capture new rows and not changes to existing rows. This is because, even with a composite key, Job bookmarks will not retrieve values for a key that are lower than the previously processed bookmark value. Please see this limitation as described in the following documentation.

https://docs.aws.amazon.com/glue/latest/dg/glue-troubleshooting-errors.html#error-job-bookmarks-reprocess-data

If you are using a relational database (a JDBC connection) for the input source, job bookmarks work only if the table's primary keys are in sequential order. Job bookmarks work for new rows, but not for updated rows. That is because job bookmarks look for the primary keys, which already exist. [1]

https://aws.amazon.com/blogs/big-data/load-data-incrementally-and-optimized-parquet-writer-with-aws-glue/

When you access relational databases using a JDBC connection…Job bookmarks can capture only newly added rows [2]

Note that this limitation only applies to JDBC sources and not source tables in S3.

From the shared script , I could see you have used "jobBookmarkKeys":["ID"] & "jobBookmarkKeys":["id"] and please make sure if the columns has case sensitivity on your JDBC source, if case sensitivity present on "id/ID" column both keys are treated as same Key.

Since bookmark key works based on tranformation_ctx , please try using different name on both transformation_ctx and retry your Glue Job.

Thanks!

AWS
支持工程师
已回答 7 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则