How to transform the large volume of data in DynamoDB table

0

I have a lot of data in DynamoDB table and i would like to do the record update without scanning each record. What is best way to handle this scenario ? Is it wise to use AWS glue on top of DynamoDB and do the transformation of the records and finally update the same target table of DynamoDB ?

Ram
asked 2 months ago108 views
2 Answers
2

You have to Scan the items some way or another. If avoiding the Scan is to only avoid impacting production traffic then you can use the Export To S3 feature, which comes built into AWS Glue. This only saves on capacity consumption and not cost, so if cost is your concern for not Scanning, then I suggest to just Scan as its the most cost effective way.

My only concern with Glue is that by default it's an overwrite operation. Meaning if you read at time T1, transform and write to DDB at time T3, then any updates made at time T2 will be lost.

One way to overcome that is to use Glue to read the data, and instead of using the spark write, you could use Boto3 UpdateItem calls in a distributed manner, meaning you only update values, not overwrite them. This also allows you to use conditions on your writes: Accept update, only if nothing has been written since I read.....

profile pictureAWS
EXPERT
answered 2 months ago
  • Adding to Lee's excellent response, consider the possibility of having your regular writes/updates transform the data over time. If you can adjust your reads to accept the data in old or new form, then you can just start making all your regular application writes use the new form. In some cases this approach will mean that you don't have to do the bulk update - and at the very least it will reduce the number of writes required to do that "backfill" transformation.

-1

Using AWS Glue to transform and update records in DynamoDB would work but may not be the most efficient approach for large datasets. A few better options to consider:

  • Use DynamoDB Streams to capture any item updates or writes to the table. Have a Lambda function process the stream records and apply the necessary transformations before writing the items to the new target table.
  • Use AWS Data Pipeline to automate a periodic data migration job. This can fetch records from the source table, transform as needed, and write to the target table in batches.
  • Export the table data to S3 using DynamoDB Export, then use AWS Glue to perform ETL on the exported files. The transformed data can then be written back to the target DynamoDB table.
  • For simple attribute updates, use the DynamoDB UpdateItem API directly without scanning the whole table. You can update multiple items in parallel using BatchWriteItem

table.update_item(
  Key={
    'id': item_id
  },
  UpdateExpression="SET attr1 = :val1",
  ExpressionAttributeValues={':val1': new_value}  
)
profile picture
EXPERT
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions