How does data retrieval work with a DynamoDB Glue Catalog table?

1

Hi,

I've configured a Glue Catalog table and crawler that indexes a DynamoDB table, and want to use this Glue catalog table to query the DDB table in a Glue Job (using the 'glueContext.create_dynamic_frame_from_catalog()' function).

My question is, what operation against the DDB table is performed in this situation to populate the resulting DynamicFrame which I can then access/query in my Glue job? Is a 'SCAN' operation performed against the DDB table to retrieve all of the data to populate the dynamic frame, or are 'QUERY' operations performed lazily against the DDB table once I've defined any filter criteria for my DynamicFrame?

As a follow-up question, if I need to use a DDB table within a Glue Job to perform lookups for specific primary keys, what is the recommended/most cost-efficient way to do this? Should I use a Glue catalog table as I'm doing now, or should I query the DDB table directly using Boto3?

For information, my Glue Job is using PySpark.

cgddrd
질문됨 2년 전3000회 조회
1개 답변
3
수락된 답변

When you use a Glue Crawler to crawl your table, Glue does a Scan of the table and reads the first 1MB (known as data sampling) to infer the schema. If you have a table which the schema is wildly different then you should disable data sampling and let the crawler Scan the entire table.

When reading the table using glueContext.create_dynamic_frame_from_catalog() then a full table Scan is performed in parallel defined by dynamodb.splits and read into the data-frame. This does consume capacity from your table and if your table is used for other applications you can limit the rate at which Glue Scans from [0-1.5] dynamodb.throughput.read.percent so it doesn't consume all the available capacity. Any filtering criteria must be done after the data is read in whole.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb

You also now have the option to read the data without consuming capacity, in which Glue first uses DynamoDB ExportTableToPointInTime and exports the table to S3. PITR must be enabled on your DynamoDB table. Any subsequent reads from the catalog would be stale until you do an export. This can be useful for 1 time jobs, or jobs that happen infrequently.

dyf = glue_context.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.export": "ddb",
        "dynamodb.tableArn": "<test_source>",
        "dynamodb.s3.bucket": "<bucket name>",
        "dynamodb.s3.prefix": "<bucket prefix>",
        "dynamodb.s3.bucketOwner": "<account_id>",
    }
)

if I need to use a DDB table within a Glue Job to perform lookups for specific primary keys, what is the recommended/most cost-efficient way to do this? Should I use a Glue catalog table as I'm doing now, or should I query the DDB table directly using Boto3?

Unfortunately the Glue DynamoDB connector does not provide the ability to query on specific keys, you would be forced to use boto3 which is not distributed.

profile pictureAWS
전문가
답변함 2년 전
  • This is very useful - thanks.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠