Needs to clean-up stale data from DynamoDB table which is older than 90 days by keeping cost as low as possible

0

Use Case: Cleanup old telemetry data from very Big DynamoDb table based on timestamp which is older then 90 days

**Date Size: 750 GB ( Ireland + Ohio + Beijing )

Item Count : 5000 million records**

Sample Data:

{
    "deviceid": "a00a940c-b804-4806-a14f-7c713964fc65",
    "timestamp": 1661521317,
    "notifications": false,
    "payload": [
      {
        "n": "temprature",
        "v": "20"
      },
      {
        "n": "humidity",
        "v": "91"
      }
    ]
  }

Goal: Needs to clean-up stale data from DynamoDB table which is older than 90 days by keeping cost as low as possible

asked 2 years ago230 views
2 Answers
1

Hi,

I'm not sure why you state 98% of data as that only adds complexity to the process. Would 100% of the data which exceeds 90days not meet your requirements?

Your timestamp attribute seems to meet the TTL criteria, its an epcoh time in seconds and of type number. For that reason, you can simply enable TTL on your table and DynamoDB will delete all the items exceeding 90days free of charge.


Otherwise this is the other option:

As you main priority seems to be keeping the cost as low as possible, it is best if you include rate limiting on the client side to ensure the process doesn't over consume capacity which would increase costs:

  1. Scan the table, Parallel Scan is preferable for large tables. Use a FilterExpression to only return items which exceed 90 days.
  2. Rate limit the Scan so that it does not consume large amounts of capacity, this can be done by adding sleep() to your logic
  3. Delete the items returned in each page of the Scan
  4. Checkpoint the LastEvaluatedKey locally for each page you delete, you want to be able to pick up where you left off in case the application crashes for any reason.

Going forward it is advisable to add a TTL attribute to your tables so DynamoDB will delete the items for free on your behalf.

profile pictureAWS
EXPERT
answered 2 years ago
  • Hi @Leeroy Hannigan , in our use case timestamp shows the time when data got added in db, so if we will add ttl in timestamp all data will be deleted.

    So the only way to scan all data and and add ttl on the data which is older then 90 days.

    Would 100% of the data which exceeds 90days not meet your requirements? Yes, I maen here is that only 2% data is new data and rest 98% data is stale data which needs to be cleanedup.

  • Ok, so your timestamp uses the current date. In that case you would need to use a process to delete it. One other approach would be to export the data to S3, use a Glue ETL job to drop the 98% of data and output to S3. Then use DynamoDB import from S3 to create the table with the needed data.

0

I'll point out it would be 50x more cost effective to write the 2% of the data to a new table rather than to delete 98% of the data. After migration just delete the original table (zero cost). If you want to accept writes while the scan/copy migration process is happening, you can bifurcate all writes to both tables until the migration process finishes.

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions