Glue - Predicate pushdown with Dynamodb

0

Hello,

   dynamo_df = glueContext.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={"dynamodb.input.tableName": lkup_table,
        "dynamodb.throughput.read.percent": "1.0",
        "dynamodb.splits": "100"
    }
    )  

It seems Glue is loading entire dynamodb table (lkup_table). If I add filter, like dynamo_df .filter(col('case')=='1234') - 1.Spark first loads entire table into df 2.Then it filterout the records which isn't efficient way. Is there anyway to add predicate pushdown that avoids complete table load into dataframe (dynamo_df)? Pl. suggest

demandé il y a un an364 vues
2 réponses
2

Unfortunately DynamoDB does not support predicate push down syntax, as its a NoSQL database and to apply the filter the entire table would need to be read regardless.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html

If this is a one-time read then you can consider the export to S3 capability but if you intend on reading continuously you may just want to read the table directly to get more up-to-date data.

profile pictureAWS
EXPERT
répondu il y a un an
1
Réponse acceptée

There is for s3 tables but unfortunately not for DynamoDB.
What you can do is minimize the performance hit (and cost) by using the new s3 export for DynamoDB.
Check this blog: https://aws.amazon.com/blogs/big-data/accelerate-amazon-dynamodb-data-access-in-aws-glue-jobs-using-the-new-aws-glue-dynamodb-elt-connector/

profile pictureAWS
EXPERT
répondu il y a un an
profile pictureAWS
EXPERT
vérifié il y a un an

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions