Glue - Predicate pushdown with Dynamodb

0

Hello,

   dynamo_df = glueContext.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={"dynamodb.input.tableName": lkup_table,
        "dynamodb.throughput.read.percent": "1.0",
        "dynamodb.splits": "100"
    }
    )  

It seems Glue is loading entire dynamodb table (lkup_table). If I add filter, like dynamo_df .filter(col('case')=='1234') - 1.Spark first loads entire table into df 2.Then it filterout the records which isn't efficient way. Is there anyway to add predicate pushdown that avoids complete table load into dataframe (dynamo_df)? Pl. suggest

已提问 1 年前364 查看次数
2 回答
2

Unfortunately DynamoDB does not support predicate push down syntax, as its a NoSQL database and to apply the filter the entire table would need to be read regardless.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html

If this is a one-time read then you can consider the export to S3 capability but if you intend on reading continuously you may just want to read the table directly to get more up-to-date data.

profile pictureAWS
专家
已回答 1 年前
1
已接受的回答

There is for s3 tables but unfortunately not for DynamoDB.
What you can do is minimize the performance hit (and cost) by using the new s3 export for DynamoDB.
Check this blog: https://aws.amazon.com/blogs/big-data/accelerate-amazon-dynamodb-data-access-in-aws-glue-jobs-using-the-new-aws-glue-dynamodb-elt-connector/

profile pictureAWS
专家
已回答 1 年前
profile pictureAWS
专家
已审核 1 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则