- 最新
- 投票最多
- 评论最多
Hope this can help you.
- You can export each entity type (projects, tasks, events etc.) separately from DynamoDB to different S3 files/folders preserving the type.
- When creating the Athena tables, define the schema and partition structure specific to each type. For example create one table for projects with relevant columns, another for tasks.
- The entity types should contain references to each other like task containing project ID.
- You can then run join queries in Athena between the tables to retrieve related data. For example, get all tasks for a project by joining on project ID.
To export from DynamoDB to S3, use the
dynamodb export
command specifying the table, S3 path and other options.
Your looking to flatten your DynamoDB tables in S3. Short answer: yes, it is possible, however, with Athena you would have to manually create tables for each entity using queries and manually update to propagate changes made in DynamoDB to your S3 dataset using more queries. Instead, I recommend Glue ETL jobs to avoid having to run queries each time. Let's break down how you can achieve this.
A single table for all of your data entities is a common design pattern in DynamoDB. DyanamoDB has a Export to S3 feature which requires you to enable Point in Time recovery backups on your DynamoDB table. Just exporting to build your S3 tables will limit you to only having one table created in your Glue Data Catalog for each DynamoDB table. Hence, you would need to either run Athena queries or use Glue ETL jobs on top of the exported DynamoDB table to achieve what you want with additional processing.
Here are the steps outlined to use Glue ETL jobs:
- Create a Glue job with your source being your DynamoDB table. Behind the scenes, this step uses the same DynamoDB Export to S3 feature to get your data over to S3 first. You will need to specify a database name, table name, and S3 location to where you want your table to be exported to. This S3 location can be different than the final S3 location you will be using for the tables you want to run your queries on.
- Add a step to either use Glue transformations or Spark transformations to split up your table by entities. Separate dataframes should be created for each entity you want tables for.
- Include any additional processing such as cleaning strings or renaming columns.
- Add a step for a Glue S3 sink or use Spark's
write
to export the new dataframes that were created as a table in S3 with the rewrite option so there is no duplicated data. - Add new tables in your Glue Data Catalog and point it to where your saving the exported dataframes.
- Run your Glue job.
- Query your newly created S3 tables from Athena.
If the size of you DynamoDB table is very large and you need your S3 dataset to be highly consistent with your DynamoDB table, I would not recommend this approach because your Glue jobs can become costly with your entire dataset is being rewritten every time. Instead, I would recommend an even-driven approach with DynamoDB streams or Kinesis data streams and only update your S3 dataset as changes are being made to your DynamoDB table.
相关内容
- AWS 官方已更新 2 年前
- AWS 官方已更新 3 年前
- AWS 官方已更新 10 个月前
thanks @Giovanni I checked the DynamoDB Api and I couldn't find any "export" command. I did find "export-table-to-point-in-time" command but it doesn't allow to do a selective export of dynamodb items. Another option I'm looking at is to have dynamodb Streams and everytime an item is created or updated I could insert the item into and s3 bucket with an specific prefix, depending if it is a project or task or project event etc. But then it doesn't seem like an efficient solution. Also I thouhg about having a Kinesis stream, but I don't need a real time updates on the S3 bucekt.