By using AWS re:Post, you agree to the AWS re:Post Terms of Use

Update Records with AWS Glue

0

I have two S3 buckets with data tables, namely A and B, and a Glue job, that transforms data from A to B. Both tables contain a column called x. The Glue job performs a GroupBy operation on this column x, which results in transforming all other columns from table A into list type columns for table B. I activate the bookmarking mechanism for the Glue job, so that it processes only new data. That requires, that I also read in inputs from table B (which are outputs of the previous run of the Glue job) in this job and append new items to the list type columns in case a record with a specific value for column x already exists. It is unclear for me how I could update the table B when saving outputs of the Glue job and avoid duplicated values of column x. Does anybody have a hint here? Thanks!

asked 3 years ago1.8K views
1 Answer
0

Seems like you are reading Source A table and combining that with processed table B. In that case I would say simple overwrite the result in B and combine Table A and B all the time in glue job.

AWS
answered 3 years ago
  • In that case I would have to process the entire table B all the time in order not to loose any records (e.g. that are not in the currently processed batch due to bookmarking). For large tables B that does not sound very efficient to me.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions