Is it possible to add bookmark to Glue script connecting DynamoDB to S3?

0

Hello

I am writing a glue script to transfer a table from DynamoDB to S3 bucket. I have put the necessary config into the code and enabled bookmark in Job Details and ran the script three times and found tripled qty of items in S3, so bookmark failed. Is it because I have set things wrong here? Thanks in advance.

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.sql import SQLContext
from awsglue.dynamicframe import DynamicFrame

args = getResolvedOptions(
    sys.argv, 
    [
        "JOB_NAME",
        "raw_bucket", 
        "dataset_folder"
    ])
    
glueContext= GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

raw_bucket = args["raw_bucket"]
dataset_folder = args["dataset_folder"]

node_ddb_table1 = glueContext.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    transformation_ctx="node_ddb_table1 ",
    connection_options={"dynamodb.input.tableName": "ddb-table-1",
        "dynamodb.throughput.read.percent": "0.2",
        "dynamodb.splits": "2"
    },
    additional_options={"jobBookmarkKeys":["id"], "jobBookmarkKeysSortOrder":"asc", "mergeSchema": "true"}
)
df = node_ddb_sit_planet_payment_merchant.toDF()

dyf = DynamicFrame.fromDF(df, glueContext, "dyf")
glueContext.write_dynamic_frame.from_options(
    frame=dyf,
    connection_type="s3",
    connection_options={"path": f"s3://{raw_bucket}/dynamodb/node_ddb_table1_bookmarked/"},
    format="parquet",
    format_options={
        "separator": ","
    },
    transformation_ctx="datasink1"
)
job.commit()
posta 2 mesi fa166 visualizzazioni
3 Risposte
1
Risposta accettata

Hi,

According to AWS documentation, Glue bookmarks are not available for DynamoDB, just for JDBC data sources, and some Amazon S3 sources.

profile picture
ESPERTO
con risposta 2 mesi fa
0

I think you should change node_ddb_sit_planet_payment_merchant.toDF() to node_ddb_table1.toDF() to fix this issue. also you have double check the necessary permissions to write to the specified S3 path.

AWS
con risposta 2 mesi fa
0

as stated, since glue doesn't support bookmarks for Dynamo, you can create your own bookmark. all you need to do is use an attribute, like your id field, or a datetime field, probably best to use a epoch (timestamp) field. Then have an index on the ddb table where you can query for values greater than the value from the last run. As a part of your glue job you would store the last processed value in S3, in JSON format for example, and read that value at the start of your glue script.

con risposta 2 mesi fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande