Multiple Athena tables from same S3 source from DynamoDB single table design

0

I have a dynamodb table design to store several types of data relationships (single table design). It stores "Projects" and also stores "tasks" and "updates on projects", "updates on tasks", "users" and several other entities. after some time I need to export the data to an S3 so I can perform advanced queries on the data with Athena. My question is. All the Athena tutorials talk about creating only one table from an s3 bucket data and it assumes that all the data items have the same structure(e.g. same JSON). In my case I'm exporting multiple items with different structure. A "Project" has different attributes that a "Task". And from Athena it would be great to have two or several tables. for Example. one table for "Projects" another for "tasks" that have a proyect ID so I can run a join query "give me all the tasks that belong to a specific project". having another table for events and see all the updates on a specific task. Just like a ERD database. Is this possible to achieve with Athena? if Yes. how can it be done? creating multiple tables from the same S3 data source and specifying different attributes? and how to tell athena to look only for the specific item that describe for example "task" and to create the table "task"

thanks

mvp
asked 2 months ago185 views
2 Answers
0

Hope this can help you.

  • You can export each entity type (projects, tasks, events etc.) separately from DynamoDB to different S3 files/folders preserving the type.
  • When creating the Athena tables, define the schema and partition structure specific to each type. For example create one table for projects with relevant columns, another for tasks.
  • The entity types should contain references to each other like task containing project ID.
  • You can then run join queries in Athena between the tables to retrieve related data. For example, get all tasks for a project by joining on project ID.

To export from DynamoDB to S3, use the

dynamodb export

command specifying the table, S3 path and other options.

profile picture
EXPERT
answered 2 months ago
  • thanks @Giovanni I checked the DynamoDB Api and I couldn't find any "export" command. I did find "export-table-to-point-in-time" command but it doesn't allow to do a selective export of dynamodb items. Another option I'm looking at is to have dynamodb Streams and everytime an item is created or updated I could insert the item into and s3 bucket with an specific prefix, depending if it is a project or task or project event etc. But then it doesn't seem like an efficient solution. Also I thouhg about having a Kinesis stream, but I don't need a real time updates on the S3 bucekt.

0

Your looking to flatten your DynamoDB tables in S3. Short answer: yes, it is possible, however, with Athena you would have to manually create tables for each entity using queries and manually update to propagate changes made in DynamoDB to your S3 dataset using more queries. Instead, I recommend Glue ETL jobs to avoid having to run queries each time. Let's break down how you can achieve this.

A single table for all of your data entities is a common design pattern in DynamoDB. DyanamoDB has a Export to S3 feature which requires you to enable Point in Time recovery backups on your DynamoDB table. Just exporting to build your S3 tables will limit you to only having one table created in your Glue Data Catalog for each DynamoDB table. Hence, you would need to either run Athena queries or use Glue ETL jobs on top of the exported DynamoDB table to achieve what you want with additional processing.

Here are the steps outlined to use Glue ETL jobs:

  1. Create a Glue job with your source being your DynamoDB table. Behind the scenes, this step uses the same DynamoDB Export to S3 feature to get your data over to S3 first. You will need to specify a database name, table name, and S3 location to where you want your table to be exported to. This S3 location can be different than the final S3 location you will be using for the tables you want to run your queries on.
  2. Add a step to either use Glue transformations or Spark transformations to split up your table by entities. Separate dataframes should be created for each entity you want tables for.
  3. Include any additional processing such as cleaning strings or renaming columns.
  4. Add a step for a Glue S3 sink or use Spark's write to export the new dataframes that were created as a table in S3 with the rewrite option so there is no duplicated data.
  5. Add new tables in your Glue Data Catalog and point it to where your saving the exported dataframes.
  6. Run your Glue job.
  7. Query your newly created S3 tables from Athena.

If the size of you DynamoDB table is very large and you need your S3 dataset to be highly consistent with your DynamoDB table, I would not recommend this approach because your Glue jobs can become costly with your entire dataset is being rewritten every time. Instead, I would recommend an even-driven approach with DynamoDB streams or Kinesis data streams and only update your S3 dataset as changes are being made to your DynamoDB table.

profile pictureAWS
EXPERT
answered 18 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions