Skip to content

Glue crawler chokes on DynamoDB table which has an attribute of type "map" with dynamic keys

0

We have a DynamoDB table which has an attribute named CustomAttributes of data type "map" that uses dynamic keys (i.e. we use it as a hash map). Here's a made-up DynamoDB item to show what it might look like:

{
  "CustomAttributes": {
    "AttrId1": [...],
    "AttrId2": [...]
  }
}

Any given DynamoDB item has a small number of keys but it seems like the Glue crawler tries to compute the column type by taking the union of all the keys it sees across (a sample of) the items it processes.

The issue is that as the number of possible keys grows (which it does, since it's dynamic) eventually the Glue crawler fails with

com.amazonaws.services.glue.model.ValidationException: Value at 'table.storageDescriptor.columns.40.member.type' failed to satisfy constraint: Member must have length less than or equal to 131072

Can this be worked around with the Glue crawler? Do we have to come up with a different design altogether?

P.S. Note that using the Athena DynamoDB connector without a Glue crawler-produced schema doesn't work, as the connector only looks at a small number of items to infer the schema and thus fails to infer the schema, as not all items have all columns (i.e. we'd see a failure like "there's no column CreatedAt" when we rely on inference instead of the Glue crawler).

asked a year ago210 views
1 Answer
0

The issue you're experiencing is related to how AWS Glue crawlers handle DynamoDB tables with dynamic map attributes. When a DynamoDB table has a map attribute with dynamic keys (like your CustomAttributes), the crawler attempts to create a schema that represents all possible keys it encounters. As the number of unique keys grows, the schema becomes increasingly complex, eventually exceeding Glue's constraints.

There are a few approaches you can take to work around this limitation:

  1. Use a custom JSON classifier: If you're planning to export the DynamoDB data to S3 for querying with Athena, you could create a custom JSON classifier that handles the nested structure more efficiently.

  2. Use the Athena DynamoDB connector: To query DynamoDB directly from Athena, you'll need to use the Athena DynamoDB connector. This connector uses Lambda functions to query data in DynamoDB and may handle complex map structures differently than the Glue crawler.

  3. Modify your data model: Consider restructuring your data to use a more consistent schema. Instead of dynamic keys in a map, you could use an array of key-value pairs with a consistent structure.

  4. Use relationalize or unnest transformations: If you're processing this data in a Glue ETL job, you can use the relationalize function to convert nested structures into separate tables, or the unnest option to convert nested fields into top-level objects.

  5. Export to S3 in a flattened format: You could create a Glue ETL job that exports your DynamoDB data to S3 in a more Athena-friendly format, flattening the dynamic attributes into a more queryable structure.

Each of these approaches has trade-offs in terms of query performance, data freshness, and implementation complexity. The best solution depends on your specific use case and requirements.
Sources
Process JSON files in AWS Glue | AWS re:Post
Getting error while querying DynamoDB in Athena using Glue Crawler | AWS re:Post
HIVE_UNKNOWN_ERROR: java.net.URISyntaxException: Relative path in absolute URI: arn:aws:dynamodb:us-east-1:<account-id>:table/mytable | AWS re:Post
AWS Glue type systems - AWS Glue

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.