By using AWS re:Post, you agree to the AWS re:Post Terms of Use

Creating AWS Glue tables as metadata mapping layer for DynamoDB items without using Athena connector

0

Problem Description: I need to create a metadata mapping layer in AWS Glue for DynamoDB items without crawling the entire table. The goal is to enable querying DynamoDB through Athena while not knowing item attributes in advance.

Requirements:

  • Two Glue tables: one for item_type DDB Item metadata, one for actual item_type_item Item
  • Must map DynamoDB items' structure without exporting/crawling entire table
  • Should enable Athena queries using this metadata

What I've Tried:

  1. Creating Glue tables with boto3:
client.create_table(
                DatabaseName=database,
                TableInput={
                    "Name": table_name,
                    "StorageDescriptor": {
                        "Location": "my_ddb_table", 
                        "Columns": columns,
                    },
                    "TableType": "EXTERNAL_TABLE",
                    "Parameters": {
                        "classification": "dynamodb",
                        "typeOfData": "table",
                        "rangeKey": "sk", 
                        "hashKey": "pk"
                    },
                },
            )

Athena queries return: Error: HIVE_UNSUPPORTED_FORMAT: Unable to create input format

I even tried passing

"StorageDescriptor": {  
          "Columns": {} 
          "Location": f"arn:aws:dynamodb:{region}:{account_id}:table/{dynamodb_table}"
},

Tables are associated with Glue Database but not available for queries in Athena.

  1. Tried using Glue crawler:
  • Crawls entire DynamoDB table
  • Problem: Returns all items without distinguishing between item_type and actual items
  • Cannot pre-define attribute structure

Question: Is it possible to create Glue tables that map DynamoDB item structures without:

  • Using the Athena DynamoDB connector
  • Crawling the entire table
  • Exporting data to S3

The Athena connector would be used later for actual queries, but first I need this metadata mapping layer. How can I achieve this?

Technical Context:

  • Using AWS Lambda with boto3
  • DynamoDB table uses single-table design
  • Need to maintain mapping between item_type definition and actual items of item_type.
2 Answers
0

Creating AWS Glue tables as a metadata mapping layer for DynamoDB items without using the Athena connector or crawling the entire table is challenging, but there are a few approaches you can consider:

  1. Manual Table Creation: Instead of using a crawler or exporting data, you can manually create Glue tables that represent the structure of your DynamoDB items. This approach requires you to know the structure of your items in advance, but it doesn't involve scanning the entire table.

    You can create two tables:

    • One for item_type metadata
    • Another for item_type_item data

    When creating these tables, you'll need to define the schema based on your knowledge of the DynamoDB table structure. This method allows you to create a metadata mapping layer without crawling or exporting data.

  2. Sample-based Schema Inference: If you don't know the exact structure of all items, you could implement a custom solution that samples a small subset of items from your DynamoDB table. Use this sample to infer the schema and create Glue tables accordingly. This approach doesn't require crawling the entire table but gives you an approximation of the item structure.

  3. DynamoDB Streams: If your DynamoDB table has streams enabled, you could create a Lambda function that processes the stream events. This function could analyze the structure of new or modified items and update the Glue table definitions accordingly. This method allows you to build and maintain your metadata mapping layer incrementally without scanning the entire table.

  4. Custom Metadata Management: Implement a custom solution where you maintain metadata about your DynamoDB items separately. This could be in another DynamoDB table or a different data store. When you add or modify items in your main table, update this metadata store. Then use this metadata to create and update your Glue tables.

For any of these approaches, you'll need to ensure that the Glue tables are created with the correct parameters for DynamoDB integration. This includes specifying the correct "classification" as "dynamodb" and providing the necessary "hashKey" and "rangeKey" information.

Remember that while these approaches can help you create a metadata mapping layer, they may not provide real-time accuracy if your DynamoDB table structure changes frequently. You'll need to implement additional processes to keep your Glue tables up-to-date with any structural changes in your DynamoDB items.

Lastly, note that querying DynamoDB through Athena without using the Athena DynamoDB connector may not be possible or efficient. The connector is designed specifically for this purpose and handles the complexities of DynamoDB's data model. If you're creating this metadata layer to enable Athena queries, you may still need to use the connector for the actual query execution.
Sources
Creating tables - AWS Glue
DynamoDB connections - AWS Glue
DynamoDBTarget - AWS Glue

profile picture
answered 7 days ago
0

Creating a metadata mapping layer for DynamoDB items in AWS Glue without scanning the entire table or exporting data can be challenging, especially with a single-table design. Below are some approaches to tackle this efficiently:

  1. Selective Queries for Metadata For scenarios where the item attributes aren't known beforehand, you can infer the schema using sample-based methods:
  • Query Specific Partitions: Use the Query API to retrieve items based on known partition keys (e.g., item_type), or use the Scan API with filters.
  • Limit Results: Use the Limit parameter to control the number of items retrieved, minimizing resource usage.
  • Schema Inference: Extract the attributes from these sampled items to create a flexible metadata structure that Glue can use.
  1. Manual Table Creation: If you're familiar with your DynamoDB table structure, manually creating Glue tables may be the best option. You can define two tables:

a)Metadata Table: Maps item_type to its attributes. Example Python code for creating a metadata table with boto3:

glue_client.create_table(
    DatabaseName="your_database",
    TableInput={
        "Name": "item_type_metadata",
        "StorageDescriptor": {
            "Columns": [{"Name": "item_type", "Type": "string"}] +
                       [{"Name": attr, "Type": "string"} for attr in attributes],
            "Location": f"arn:aws:dynamodb:region:account_id:table/{dynamodb_table}"
        },
        "TableType": "EXTERNAL_TABLE",
        "Parameters": {"classification": "dynamodb", "typeOfData": "table"}
    }
)

b)Item Table: Represents the actual items and references the metadata table. This approach is straightforward when the schema is static or evolves predictably.

  1. Enable Athena Queries: If you encounter errors like HIVE_UNSUPPORTED_FORMAT when querying DynamoDB through Athena, ensure the Glue table is configured correctly:
  • Validate that the Glue table's schema matches the DynamoDB data structure.
  1. DynamoDB Streams for Real-Time Updates: For tables that frequently change structure, DynamoDB Streams is an effective solution:
  • Create a Lambda function to process stream events and analyze new or modified items.
  • Use this Lambda to incrementally update the metadata in your Glue table. This ensures the metadata layer stays up-to-date without scanning the table repeatedly.

Considerations:

  • These methods provide approximate schema representations and may need periodic updates if the data structure changes frequently.
  • While these approaches help create a metadata mapping layer, querying DynamoDB through Athena without the DynamoDB connector is inefficient. For querying, consider enabling the Athena DynamoDB connector.

By combining these strategies, you can create and maintain a metadata mapping layer, enabling efficient queries on DynamoDB data through Glue and Athena. For more specific help, you can also reach out to AWS Support for tailored guidance.

AWS
answered 4 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions