Is it possible to add bookmark to Glue script connecting DynamoDB to S3?

0

Hello

I am writing a glue script to transfer a table from DynamoDB to S3 bucket. I have put the necessary config into the code and enabled bookmark in Job Details and ran the script three times and found tripled qty of items in S3, so bookmark failed. Is it because I have set things wrong here? Thanks in advance.

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.sql import SQLContext
from awsglue.dynamicframe import DynamicFrame

args = getResolvedOptions(
    sys.argv, 
    [
        "JOB_NAME",
        "raw_bucket", 
        "dataset_folder"
    ])
    
glueContext= GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

raw_bucket = args["raw_bucket"]
dataset_folder = args["dataset_folder"]

node_ddb_table1 = glueContext.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    transformation_ctx="node_ddb_table1 ",
    connection_options={"dynamodb.input.tableName": "ddb-table-1",
        "dynamodb.throughput.read.percent": "0.2",
        "dynamodb.splits": "2"
    },
    additional_options={"jobBookmarkKeys":["id"], "jobBookmarkKeysSortOrder":"asc", "mergeSchema": "true"}
)
df = node_ddb_sit_planet_payment_merchant.toDF()

dyf = DynamicFrame.fromDF(df, glueContext, "dyf")
glueContext.write_dynamic_frame.from_options(
    frame=dyf,
    connection_type="s3",
    connection_options={"path": f"s3://{raw_bucket}/dynamodb/node_ddb_table1_bookmarked/"},
    format="parquet",
    format_options={
        "separator": ","
    },
    transformation_ctx="datasink1"
)
job.commit()
asked 2 months ago157 views
3 Answers
1
Accepted Answer

Hi,

According to AWS documentation, Glue bookmarks are not available for DynamoDB, just for JDBC data sources, and some Amazon S3 sources.

profile picture
EXPERT
answered 2 months ago
0

I think you should change node_ddb_sit_planet_payment_merchant.toDF() to node_ddb_table1.toDF() to fix this issue. also you have double check the necessary permissions to write to the specified S3 path.

AWS
answered 2 months ago
0

as stated, since glue doesn't support bookmarks for Dynamo, you can create your own bookmark. all you need to do is use an attribute, like your id field, or a datetime field, probably best to use a epoch (timestamp) field. Then have an index on the ddb table where you can query for values greater than the value from the last run. As a part of your glue job you would store the last processed value in S3, in JSON format for example, and read that value at the start of your glue script.

answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions