Update Records with AWS Glue
I have two S3 buckets with data tables, namely A and B, and a Glue job, that transforms data from A to B. Both tables contain a column called x. The Glue job performs a GroupBy operation on this column x, which results in transforming all other columns from table A into list type columns for table B. I activate the bookmarking mechanism for the Glue job, so that it processes only new data. That requires, that I also read in inputs from table B (which are outputs of the previous run of the Glue job) in this job and append new items to the list type columns in case a record with a specific value for column x already exists. It is unclear for me how I could update the table B when saving outputs of the Glue job and avoid duplicated values of column x. Does anybody have a hint here? Thanks!
Seems like you are reading Source A table and combining that with processed table B. In that case I would say simple overwrite the result in B and combine Table A and B all the time in glue job.
Relevant questions
Need AWS Glue to store bad records/ records with error when reading Mongo db data to a S3 path and process the rest of the data.
asked 2 months agoSetting ACL in S3 objects written by an AWS Glue Job
Accepted AnswerGlue Jobs & Multiple tables
Accepted Answerasked 4 years agoUpdate Records with AWS Glue
asked 3 months agoGlue Workflow: Add same glue job multiple times, each with different parameters
asked 4 days agoLoading json string data as super from Glue job in to Redshift
asked 6 months agoWhat are the benefits when I run a Glue job inside VPC?
Accepted Answerasked 3 months agofail a glue job if the called stored procedure fails
asked 5 months agoUpdate postgres RDS table with AWS glue script
asked 3 months agoI need to read S3 data, transform and put into Data Catalog. Should I be using a Crawler?
Accepted Answerasked 3 months ago
In that case I would have to process the entire table B all the time in order not to loose any records (e.g. that are not in the currently processed batch due to bookmarking). For large tables B that does not sound very efficient to me.