How does data retrieval work with a DynamoDB Glue Catalog table?

1

Hi,

I've configured a Glue Catalog table and crawler that indexes a DynamoDB table, and want to use this Glue catalog table to query the DDB table in a Glue Job (using the 'glueContext.create_dynamic_frame_from_catalog()' function).

My question is, what operation against the DDB table is performed in this situation to populate the resulting DynamicFrame which I can then access/query in my Glue job? Is a 'SCAN' operation performed against the DDB table to retrieve all of the data to populate the dynamic frame, or are 'QUERY' operations performed lazily against the DDB table once I've defined any filter criteria for my DynamicFrame?

As a follow-up question, if I need to use a DDB table within a Glue Job to perform lookups for specific primary keys, what is the recommended/most cost-efficient way to do this? Should I use a Glue catalog table as I'm doing now, or should I query the DDB table directly using Boto3?

For information, my Glue Job is using PySpark.

cgddrd
質問済み 2年前2988ビュー
1回答
3
承認された回答

When you use a Glue Crawler to crawl your table, Glue does a Scan of the table and reads the first 1MB (known as data sampling) to infer the schema. If you have a table which the schema is wildly different then you should disable data sampling and let the crawler Scan the entire table.

When reading the table using glueContext.create_dynamic_frame_from_catalog() then a full table Scan is performed in parallel defined by dynamodb.splits and read into the data-frame. This does consume capacity from your table and if your table is used for other applications you can limit the rate at which Glue Scans from [0-1.5] dynamodb.throughput.read.percent so it doesn't consume all the available capacity. Any filtering criteria must be done after the data is read in whole.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb

You also now have the option to read the data without consuming capacity, in which Glue first uses DynamoDB ExportTableToPointInTime and exports the table to S3. PITR must be enabled on your DynamoDB table. Any subsequent reads from the catalog would be stale until you do an export. This can be useful for 1 time jobs, or jobs that happen infrequently.

dyf = glue_context.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.export": "ddb",
        "dynamodb.tableArn": "<test_source>",
        "dynamodb.s3.bucket": "<bucket name>",
        "dynamodb.s3.prefix": "<bucket prefix>",
        "dynamodb.s3.bucketOwner": "<account_id>",
    }
)

if I need to use a DDB table within a Glue Job to perform lookups for specific primary keys, what is the recommended/most cost-efficient way to do this? Should I use a Glue catalog table as I'm doing now, or should I query the DDB table directly using Boto3?

Unfortunately the Glue DynamoDB connector does not provide the ability to query on specific keys, you would be forced to use boto3 which is not distributed.

profile pictureAWS
エキスパート
回答済み 2年前
  • This is very useful - thanks.

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ