Sync DynamoDB to S3
What is the best way to sync my DynamoDB tables to S3, so that I can perform serverless 'big data' queries using Athena? The data must be kept in sync without any intervention. The frequency of sync would depend on the cost, ideally daily but perhaps weekly.
I have had this question a long time. I will cover what I have considered, and why I don't like the options.
1) AWS Glue Elastic Views. Sounds like this will do the job with no code, but it was announced 18 months ago and there have been no updates since. Its not generally available, and there is not information on when it might be.
2) Use dynamodb native backup following this blog https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/. I actually already use this method for 'one-off' data transfers that I kick-off manually and then configure in Athena. I have two issues with this option. The first is that, to my knowledge, the export cannot be scheduled natively. The blog suggests using the CLI to kick off exports, and I assume the writer intends that the CLI would need scheduling on a cron job somewhere. I don't run any servers for this. I imagine I could do it via a scheduled Lambda with an SDK. The second issue is that the export path in S3 always includes a unique export ID. This means I can't configure the Athena table to point to a static location for the data and just switch over the new data after a scheduled export. Perhaps I could write another lambda to move the data around to a static location after the export has finished, but it seems a shame to have to do so much work and I've not seen that covered anywhere before.
3) I can use data pipeline as described in https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/DynamoDBPipeline.html. This post is more about backing data up than making it accessible to Athena.
I feel like this use case must be so common, and yet none of the ideas I've seen online are really complete. I was wondering if anyone had any ideas or experiences that would be useful here?
There is several options available to you. You can use:
- DynamoDB Streams >> Lambda >> S3
- but this lacks in partitioning and aggregation, which can be fixed by introducing firehose on the next point
- DynamoDB Streams >> Lambda >> Firehose >> S3
- This allows you to aggregate and partition the data with keys of your choice but is not as native as the next point, removing Lambda
- Kinesis Data Streams >> Firehose >> S3
Depending on how often you read the data, you could also use DynamoDB Athena Connector, which will allow you to run Athena SQL queries directly against your table.
The simplest way is to use DynamoDB Streams into Kinesis Firehose. On the S3 folders, you can run AWS Glue crawler, and then define a View in Athena to simplify the access to the
Item object you get in from the crawler.
You can extend this flow to allow real-time processing with Kinesis Analytics.
AWS Glue not properly crawling s3 bucket populated by "Resource Data Sync" -- specifically, "AWS: InstanceInformation" is not made into a tableasked a month ago
How to use Amplify Datastore to sync with data from DynamoDB and seed DynamoDB from a Lambaasked 4 months ago
Migrate on-prem data( files) to S3 using Snowball and data sync using DataSyncasked 3 months ago
Possible to save Honeycode data directly to Dynamodbasked 2 months ago
S3 sync limitations on index size?asked 3 months ago
Can we export the mongodb using the data pipeline?Accepted Answerasked 6 years ago
How to store the Athena Query Results in DynamoDB table?asked 3 years ago
Upload to Glacier via CLIAccepted Answerasked 2 years ago
Drastically increase data size after sync local data to s3 deep archiveasked 3 years ago
Sync DynamoDB to S3asked 16 days ago