Is it possible to add bookmark to Glue script connecting DynamoDB to S3?

0

Hello

I am writing a glue script to transfer a table from DynamoDB to S3 bucket. I have put the necessary config into the code and enabled bookmark in Job Details and ran the script three times and found tripled qty of items in S3, so bookmark failed. Is it because I have set things wrong here? Thanks in advance.

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.sql import SQLContext
from awsglue.dynamicframe import DynamicFrame

args = getResolvedOptions(
    sys.argv, 
    [
        "JOB_NAME",
        "raw_bucket", 
        "dataset_folder"
    ])
    
glueContext= GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

raw_bucket = args["raw_bucket"]
dataset_folder = args["dataset_folder"]

node_ddb_table1 = glueContext.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    transformation_ctx="node_ddb_table1 ",
    connection_options={"dynamodb.input.tableName": "ddb-table-1",
        "dynamodb.throughput.read.percent": "0.2",
        "dynamodb.splits": "2"
    },
    additional_options={"jobBookmarkKeys":["id"], "jobBookmarkKeysSortOrder":"asc", "mergeSchema": "true"}
)
df = node_ddb_sit_planet_payment_merchant.toDF()

dyf = DynamicFrame.fromDF(df, glueContext, "dyf")
glueContext.write_dynamic_frame.from_options(
    frame=dyf,
    connection_type="s3",
    connection_options={"path": f"s3://{raw_bucket}/dynamodb/node_ddb_table1_bookmarked/"},
    format="parquet",
    format_options={
        "separator": ","
    },
    transformation_ctx="datasink1"
)
job.commit()
feita há 2 meses165 visualizações
3 Respostas
1
Resposta aceita

Hi,

According to AWS documentation, Glue bookmarks are not available for DynamoDB, just for JDBC data sources, and some Amazon S3 sources.

profile picture
ESPECIALISTA
respondido há 2 meses
0

I think you should change node_ddb_sit_planet_payment_merchant.toDF() to node_ddb_table1.toDF() to fix this issue. also you have double check the necessary permissions to write to the specified S3 path.

AWS
respondido há 2 meses
0

as stated, since glue doesn't support bookmarks for Dynamo, you can create your own bookmark. all you need to do is use an attribute, like your id field, or a datetime field, probably best to use a epoch (timestamp) field. Then have an index on the ddb table where you can query for values greater than the value from the last run. As a part of your glue job you would store the last processed value in S3, in JSON format for example, and read that value at the start of your glue script.

respondido há 2 meses

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas