Recommended way to scan and update large DDB table

0

I have a large DDB table with 7 TB of data, 25 billion rows. This is a production table. I need to scan and add/update a column to each row of the table. The table has daily export to s3. I am considering the following options

  1. AWS Glue reads data from s3, calculate new column and writes updated data to s3. On write to s3, a lambda is triggered which updates the ddb table. Concurrency is controlled using the lambda.
  2. AWS Glue reads data from s3, calculate new column and writes updated data to DDB using boto3 (this will avoid overwriting data)
  3. AWS Ec2 scans table, updates and writes data to DDB.

What would be the recommended approach to avoid affecting any production traffic during this update? The column being updated will not be immediately used. Another CM will be run to start using the column

2 Answers
0
Accepted Answer

We have a few blog posts that help with bulk updating live tables:

  1. https://aws.amazon.com/blogs/database/bulk-update-amazon-dynamodb-tables-with-aws-step-functions/
  2. https://aws.amazon.com/blogs/database/backfilling-an-amazon-dynamodb-time-to-live-ttl-attribute-with-amazon-emr/
  3. https://aws.amazon.com/blogs/database/part-2-backfilling-an-amazon-dynamodb-time-to-live-attribute-using-amazon-emr/

Regardless of the approach you take, the most important thing is to rate limit your traffic. Do the bulk update during off-peak hours, limiting any impact to prod traffic.

While there is not blog for the way I would usually handle such a thing, using a rate limit library like Google Guava and limiting your throughput based on the returned capacity consumption. Lets imagine you rate limit to 900 WCU, then you could estimate how long it would take doing 900 updates per second to 25 billion items.

profile pictureAWS
EXPERT
answered 9 days ago
profile pictureAWS
EXPERT
reviewed 2 days ago
0

Given the size and production nature of your DynamoDB table, the recommended approach would be a combination of options 1 and 2, with some modifications to ensure minimal impact on production traffic and data integrity.

Here's the recommended approach:

  1. Use the existing daily export to S3 as your data source. This avoids directly scanning the production DynamoDB table, which could impact performance.

  2. Use AWS Glue to read the exported data from S3, calculate the new column, and prepare the updates.

  3. Instead of writing directly back to DynamoDB or S3, use AWS Step Functions with a Distributed Map state to control the bulk update process. This allows for better control over the update rate and minimizes impact on the production table.

  4. Within the Step Functions workflow, use Lambda functions to process batches of updates. These Lambda functions should use the DynamoDB UpdateItem API with conditional writes to ensure that you're not overwriting any data that may have changed since the export.

  5. Implement rate limiting and error handling in your Step Functions workflow to control the pace of updates and handle any issues that arise.

  6. Monitor the process closely using CloudWatch metrics to ensure it's not affecting production traffic.

This approach offers several advantages:

  • It uses the existing export, avoiding additional scans of the production table.
  • It leverages AWS Glue for the heavy lifting of data processing.
  • It provides fine-grained control over the update process through Step Functions.
  • It uses conditional writes to prevent overwriting newer data.
  • It allows for easy pause and resume of the update process if needed.

Remember to thoroughly test this process on a smaller dataset or a copy of your table before running it on your production data. Also, ensure you have a recent backup of your table before starting the update process.

Given that the new column won't be immediately used, you have the flexibility to run this process over an extended period, further minimizing any potential impact on production traffic. You can adjust the rate of updates based on the current load on your DynamoDB table.

Lastly, consider enabling DynamoDB Streams on your table if you haven't already. This will allow you to capture changes made to the table during the update process, which could be useful for verification or if you need to make any corrections later.
Sources
Bulk update Amazon DynamoDB tables with AWS Step Functions | AWS Database Blog
How to transform the large volume of data in DynamoDB table | AWS re:Post
Guidance for Incremental Data Exports on AWS

profile picture
answered 10 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions