- 最新
- 投票最多
- 评论最多
Hi,
I'm not sure why you state 98% of data as that only adds complexity to the process. Would 100% of the data which exceeds 90days not meet your requirements?
Your timestamp
attribute seems to meet the TTL criteria, its an epcoh time in seconds and of type number. For that reason, you can simply enable TTL on your table and DynamoDB will delete all the items exceeding 90days free of charge.
Otherwise this is the other option:
As you main priority seems to be keeping the cost as low as possible, it is best if you include rate limiting on the client side to ensure the process doesn't over consume capacity which would increase costs:
- Scan the table, Parallel Scan is preferable for large tables. Use a
FilterExpression
to only return items which exceed 90 days. - Rate limit the Scan so that it does not consume large amounts of capacity, this can be done by adding
sleep()
to your logic - Delete the items returned in each page of the
Scan
- Checkpoint the
LastEvaluatedKey
locally for each page you delete, you want to be able to pick up where you left off in case the application crashes for any reason.
Going forward it is advisable to add a TTL attribute to your tables so DynamoDB will delete the items for free on your behalf.
I'll point out it would be 50x more cost effective to write the 2% of the data to a new table rather than to delete 98% of the data. After migration just delete the original table (zero cost). If you want to accept writes while the scan/copy migration process is happening, you can bifurcate all writes to both tables until the migration process finishes.
相关内容
- AWS 官方已更新 9 个月前
- AWS 官方已更新 2 年前
Hi @Leeroy Hannigan , in our use case timestamp shows the time when data got added in db, so if we will add ttl in timestamp all data will be deleted.
So the only way to scan all data and and add ttl on the data which is older then 90 days.
Would 100% of the data which exceeds 90days not meet your requirements? Yes, I maen here is that only 2% data is new data and rest 98% data is stale data which needs to be cleanedup.
Ok, so your timestamp uses the current date. In that case you would need to use a process to delete it. One other approach would be to export the data to S3, use a Glue ETL job to drop the 98% of data and output to S3. Then use DynamoDB import from S3 to create the table with the needed data.