Glue Bookmarking not working

0

Hello guys. i'm new on Glue, but i need to do a job who extract from a rds and load in s3. Script works fine, but i need a incremental load, cause every time i start the job, all data is processed, and i need only new ones or modified data. What im doing wrong? the --job-bookmark-option when enable dont load any data, and if i change to disable then all data is. processed My poc script below:

sys.argv += ["--JOB_NAME", "glue_job"]
sys.argv += ["--job-bookmark-option", "job-bookmark-enable"]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# -------------------
node = glueContext.create_dynamic_frame.from_catalog(
    database="db_rds",
    table_name="rds_req",
    transformation_ctx="node",
    additional_options = {"jobBookmarkKeys":["column"], "jobBookmarkKeysSortOrder":"asc"}
)
# -----------------
# Script generated for node Amazon S3
AmazonS3_node = glueContext.write_dynamic_frame.from_options(
    frame=df_dyf_req,
    connection_type="s3",
    format="glueparquet",
    connection_options={
        "path": "s3://",
        "partitionKeys": ["year", "month"],
    },
    format_options={"compression": "snappy"},
    transformation_ctx="AmazonS3_node",
)
job.commit()
profile picture
lebuosi
asked 7 months ago389 views
2 Answers
1
Accepted Answer

Hello,

When job bookmarks is enabled the job keeps track of which rows have been processed using a column (or columns) specified as the job bookmark key. If no bookmark key is specified, Glue uses the primary key as the bookmark key by default[1].

If no bookmark key is specified, the primary key must be sequentially increasing or decreasing with no gaps. If the bookmarks keys are user-defined, they must be strictly monotonically increasing or decreasing, with gaps permitted[1].

Please verify the bookmark key column in the source table meets this criteria. If the column in a table designated as the bookmark key does not meet this criteria, it can result in not all of the data being read, particularly in subsequent runs.

For JDBC sources, the following rules apply:

1.For each table, AWS Glue uses one or more columns as bookmark keys to determine new and processed data. The bookmark keys combine to form a single compound key. 2. AWS Glue by default uses the primary key as the bookmark key, provided that it is sequentially increasing or decreasing (with no gaps). 3. You can specify the columns to use as bookmark keys in your AWS Glue script. For more information about using Job bookmarks in AWS Glue scripts, see Using job bookmarks. 4. AWS Glue doesn't support using columns with case-sensitive names as job bookmark keys.

You can find an example in the below documentation for JDBC SOURCE[2]

[1]https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html

[2]https://docs.aws.amazon.com/glue/latest/dg/programming-etl-connect-bookmarks.html

Thank you !

AWS
SUPPORT ENGINEER
answered 7 months ago
  • thanks for the answer. I'm using a ID column as bookmark, but when bookmark is enable, job doesn't take any new data or modified data. If i change to Disable, then the job take new data, but every time i start the job he process every data, and i'm with all data multiple times. How can i deal with this?

0

Hi,

For me the whole bookmarking didn't work and to get an Upsert working for S3 I ended up building a script using the provided delta lake package from this location : https://mvnrepository.com/artifact/io.delta/delta-storage

More info on that: https://dev.to/awscommunity-asean/making-your-data-lake-acid-compliant-using-aws-glue-and-delta-lake-gk9 and here: https://dev.to/awscommunity-asean/sql-based-inserts-deletes-and-upserts-in-s3-using-aws-glue-3-0-and-delta-lake-42f0

But, now we are moving away from it because we are going to store our data no longer in S3 but in Redshift, and I found out (yesterday) that it's support Upsert functionality out of the box.

René

Rene
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions