Glue Bookmark +1 source

0

Hello guys, i'm working with glue for first time. When i do a pyspark script who take from rds and load in s3 i noticed that only the first table had the bookmark working good. There is some limitation to one bookmark for job? Script example below:

# Script  table1
table1_node = glueContext.create_dynamic_frame.from_catalog(
    database="db",
    table_name="rds_table1",
    transformation_ctx="table1_node ",
    additional_options = {"jobBookmarkKeys":["ID"], "jobBookmarkKeysSortOrder":"asc"}
)

# Script table2
table2_node = glueContext.create_dynamic_frame.from_catalog(
    database="db",
    table_name="rds_table2",
    transformation_ctx="table2_node ",
    additional_options = {"jobBookmarkKeys":["id"], "jobBookmarkKeysSortOrder":"asc"}
)
# Script table1 to S3
table1_node = glueContext.write_dynamic_frame.from_options(
    frame=df_dyf_req,
    connection_type="s3",
    format="glueparquet",
    connection_options={
        "path": "s3:///",
        "partitionKeys": ["year", "month"],
    },
    format_options={"compression": "snappy"},
    transformation_ctx="req_node",
)
# Script table2 to S3
table2_node = glueContext.write_dynamic_frame.from_options(
    frame=df_dyf_act,
    connection_type="s3",
    format="glueparquet",
    connection_options={
        "path": "s3:///",
        "partitionKeys": ["year", "month"],
    },
    format_options={"compression": "snappy"},
    transformation_ctx="act_node",
)
job.commit()

thanks guys

profile picture
lebuosi
질문됨 7달 전154회 조회
1개 답변
1
수락된 답변

Hello,

In general , when using Job bookmarks with a JDBC data source, a limitation of bookmarks is that job bookmarks will only capture new rows and not changes to existing rows. This is because, even with a composite key, Job bookmarks will not retrieve values for a key that are lower than the previously processed bookmark value. Please see this limitation as described in the following documentation.

https://docs.aws.amazon.com/glue/latest/dg/glue-troubleshooting-errors.html#error-job-bookmarks-reprocess-data

If you are using a relational database (a JDBC connection) for the input source, job bookmarks work only if the table's primary keys are in sequential order. Job bookmarks work for new rows, but not for updated rows. That is because job bookmarks look for the primary keys, which already exist. [1]

https://aws.amazon.com/blogs/big-data/load-data-incrementally-and-optimized-parquet-writer-with-aws-glue/

When you access relational databases using a JDBC connection…Job bookmarks can capture only newly added rows [2]

Note that this limitation only applies to JDBC sources and not source tables in S3.

From the shared script , I could see you have used "jobBookmarkKeys":["ID"] & "jobBookmarkKeys":["id"] and please make sure if the columns has case sensitivity on your JDBC source, if case sensitivity present on "id/ID" column both keys are treated as same Key.

Since bookmark key works based on tranformation_ctx , please try using different name on both transformation_ctx and retry your Glue Job.

Thanks!

AWS
지원 엔지니어
답변함 7달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠