How does data retrieval work with a DynamoDB Glue Catalog table?

1

Hi,

I've configured a Glue Catalog table and crawler that indexes a DynamoDB table, and want to use this Glue catalog table to query the DDB table in a Glue Job (using the 'glueContext.create_dynamic_frame_from_catalog()' function).

My question is, what operation against the DDB table is performed in this situation to populate the resulting DynamicFrame which I can then access/query in my Glue job? Is a 'SCAN' operation performed against the DDB table to retrieve all of the data to populate the dynamic frame, or are 'QUERY' operations performed lazily against the DDB table once I've defined any filter criteria for my DynamicFrame?

As a follow-up question, if I need to use a DDB table within a Glue Job to perform lookups for specific primary keys, what is the recommended/most cost-efficient way to do this? Should I use a Glue catalog table as I'm doing now, or should I query the DDB table directly using Boto3?

For information, my Glue Job is using PySpark.

cgddrd
asked 3 months ago223 views
1 Answer
3
Accepted Answer

When you use a Glue Crawler to crawl your table, Glue does a Scan of the table and reads the first 1MB (known as data sampling) to infer the schema. If you have a table which the schema is wildly different then you should disable data sampling and let the crawler Scan the entire table.

When reading the table using glueContext.create_dynamic_frame_from_catalog() then a full table Scan is performed in parallel defined by dynamodb.splits and read into the data-frame. This does consume capacity from your table and if your table is used for other applications you can limit the rate at which Glue Scans from [0-1.5] dynamodb.throughput.read.percent so it doesn't consume all the available capacity. Any filtering criteria must be done after the data is read in whole.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb

You also now have the option to read the data without consuming capacity, in which Glue first uses DynamoDB ExportTableToPointInTime and exports the table to S3. PITR must be enabled on your DynamoDB table. Any subsequent reads from the catalog would be stale until you do an export. This can be useful for 1 time jobs, or jobs that happen infrequently.

dyf = glue_context.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.export": "ddb",
        "dynamodb.tableArn": "<test_source>",
        "dynamodb.s3.bucket": "<bucket name>",
        "dynamodb.s3.prefix": "<bucket prefix>",
        "dynamodb.s3.bucketOwner": "<account_id>",
    }
)

if I need to use a DDB table within a Glue Job to perform lookups for specific primary keys, what is the recommended/most cost-efficient way to do this? Should I use a Glue catalog table as I'm doing now, or should I query the DDB table directly using Boto3?

Unfortunately the Glue DynamoDB connector does not provide the ability to query on specific keys, you would be forced to use boto3 which is not distributed.

profile picture
answered 3 months ago
  • This is very useful - thanks.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions