- Newest
- Most votes
- Most comments
We have a few blog posts that help with bulk updating live tables:
- https://aws.amazon.com/blogs/database/bulk-update-amazon-dynamodb-tables-with-aws-step-functions/
- https://aws.amazon.com/blogs/database/backfilling-an-amazon-dynamodb-time-to-live-ttl-attribute-with-amazon-emr/
- https://aws.amazon.com/blogs/database/part-2-backfilling-an-amazon-dynamodb-time-to-live-attribute-using-amazon-emr/
Regardless of the approach you take, the most important thing is to rate limit your traffic. Do the bulk update during off-peak hours, limiting any impact to prod traffic.
While there is not blog for the way I would usually handle such a thing, using a rate limit library like Google Guava and limiting your throughput based on the returned capacity consumption. Lets imagine you rate limit to 900 WCU, then you could estimate how long it would take doing 900 updates per second to 25 billion items.
Given the size and production nature of your DynamoDB table, the recommended approach would be a combination of options 1 and 2, with some modifications to ensure minimal impact on production traffic and data integrity.
Here's the recommended approach:
-
Use the existing daily export to S3 as your data source. This avoids directly scanning the production DynamoDB table, which could impact performance.
-
Use AWS Glue to read the exported data from S3, calculate the new column, and prepare the updates.
-
Instead of writing directly back to DynamoDB or S3, use AWS Step Functions with a Distributed Map state to control the bulk update process. This allows for better control over the update rate and minimizes impact on the production table.
-
Within the Step Functions workflow, use Lambda functions to process batches of updates. These Lambda functions should use the DynamoDB UpdateItem API with conditional writes to ensure that you're not overwriting any data that may have changed since the export.
-
Implement rate limiting and error handling in your Step Functions workflow to control the pace of updates and handle any issues that arise.
-
Monitor the process closely using CloudWatch metrics to ensure it's not affecting production traffic.
This approach offers several advantages:
- It uses the existing export, avoiding additional scans of the production table.
- It leverages AWS Glue for the heavy lifting of data processing.
- It provides fine-grained control over the update process through Step Functions.
- It uses conditional writes to prevent overwriting newer data.
- It allows for easy pause and resume of the update process if needed.
Remember to thoroughly test this process on a smaller dataset or a copy of your table before running it on your production data. Also, ensure you have a recent backup of your table before starting the update process.
Given that the new column won't be immediately used, you have the flexibility to run this process over an extended period, further minimizing any potential impact on production traffic. You can adjust the rate of updates based on the current load on your DynamoDB table.
Lastly, consider enabling DynamoDB Streams on your table if you haven't already. This will allow you to capture changes made to the table during the update process, which could be useful for verification or if you need to make any corrections later.
Sources
Bulk update Amazon DynamoDB tables with AWS Step Functions | AWS Database Blog
How to transform the large volume of data in DynamoDB table | AWS re:Post
Guidance for Incremental Data Exports on AWS
Relevant content
- asked a year ago
- asked 5 years ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated 3 years ago