- Newest
- Most votes
- Most comments
When you use a Glue Crawler to crawl your table, Glue does a Scan of the table and reads the first 1MB (known as data sampling) to infer the schema. If you have a table which the schema is wildly different then you should disable data sampling and let the crawler Scan the entire table.
When reading the table using glueContext.create_dynamic_frame_from_catalog()
then a full table Scan is performed in parallel defined by dynamodb.splits
and read into the data-frame. This does consume capacity from your table and if your table is used for other applications you can limit the rate at which Glue Scans from [0-1.5] dynamodb.throughput.read.percent
so it doesn't consume all the available capacity. Any filtering criteria must be done after the data is read in whole.
You also now have the option to read the data without consuming capacity, in which Glue first uses DynamoDB ExportTableToPointInTime
and exports the table to S3. PITR must be enabled on your DynamoDB table. Any subsequent reads from the catalog would be stale until you do an export. This can be useful for 1 time jobs, or jobs that happen infrequently.
dyf = glue_context.create_dynamic_frame.from_options(
connection_type="dynamodb",
connection_options={
"dynamodb.export": "ddb",
"dynamodb.tableArn": "<test_source>",
"dynamodb.s3.bucket": "<bucket name>",
"dynamodb.s3.prefix": "<bucket prefix>",
"dynamodb.s3.bucketOwner": "<account_id>",
}
)
if I need to use a DDB table within a Glue Job to perform lookups for specific primary keys, what is the recommended/most cost-efficient way to do this? Should I use a Glue catalog table as I'm doing now, or should I query the DDB table directly using Boto3?
Unfortunately the Glue DynamoDB connector does not provide the ability to query on specific keys, you would be forced to use boto3 which is not distributed.
Relevant content
- Accepted Answerasked a year ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 2 years ago
This is very useful - thanks.