By using AWS re:Post, you agree to the Terms of Use
/Sync DynamoDB to S3/

Sync DynamoDB to S3

0

What is the best way to sync my DynamoDB tables to S3, so that I can perform serverless 'big data' queries using Athena? The data must be kept in sync without any intervention. The frequency of sync would depend on the cost, ideally daily but perhaps weekly.

I have had this question a long time. I will cover what I have considered, and why I don't like the options.

1) AWS Glue Elastic Views. Sounds like this will do the job with no code, but it was announced 18 months ago and there have been no updates since. Its not generally available, and there is not information on when it might be.

2) Use dynamodb native backup following this blog https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/. I actually already use this method for 'one-off' data transfers that I kick-off manually and then configure in Athena. I have two issues with this option. The first is that, to my knowledge, the export cannot be scheduled natively. The blog suggests using the CLI to kick off exports, and I assume the writer intends that the CLI would need scheduling on a cron job somewhere. I don't run any servers for this. I imagine I could do it via a scheduled Lambda with an SDK. The second issue is that the export path in S3 always includes a unique export ID. This means I can't configure the Athena table to point to a static location for the data and just switch over the new data after a scheduled export. Perhaps I could write another lambda to move the data around to a static location after the export has finished, but it seems a shame to have to do so much work and I've not seen that covered anywhere before.

3) I can use data pipeline as described in https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html. This post is more about backing data up than making it accessible to Athena.

I feel like this use case must be so common, and yet none of the ideas I've seen online are really complete. I was wondering if anyone had any ideas or experiences that would be useful here?

2 Answers
2

There is several options available to you. You can use:

  • DynamoDB Streams >> Lambda >> S3
    • but this lacks in partitioning and aggregation, which can be fixed by introducing firehose on the next point
  • DynamoDB Streams >> Lambda >> Firehose >> S3
    • This allows you to aggregate and partition the data with keys of your choice but is not as native as the next point, removing Lambda
  • Kinesis Data Streams >> Firehose >> S3
    • Provides you the ability to have aggregated and partitioned data, as well as data retention up to 1 year. This blog and this blog are useful, and would be my preferred approach.

Depending on how often you read the data, you could also use DynamoDB Athena Connector, which will allow you to run Athena SQL queries directly against your table.

answered 15 days ago
EXPERT
reviewed 14 days ago
0

The simplest way is to use DynamoDB Streams into Kinesis Firehose. On the S3 folders, you can run AWS Glue crawler, and then define a View in Athena to simplify the access to the Item object you get in from the crawler.

See here for some details and this blog post.

You can extend this flow to allow real-time processing with Kinesis Analytics.

answered 16 days ago
EXPERT
reviewed 14 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions