Update Records with AWS Glue

0

I have two S3 buckets with data tables, namely A and B, and a Glue job, that transforms data from A to B. Both tables contain a column called x. The Glue job performs a GroupBy operation on this column x, which results in transforming all other columns from table A into list type columns for table B. I activate the bookmarking mechanism for the Glue job, so that it processes only new data. That requires, that I also read in inputs from table B (which are outputs of the previous run of the Glue job) in this job and append new items to the list type columns in case a record with a specific value for column x already exists. It is unclear for me how I could update the table B when saving outputs of the Glue job and avoid duplicated values of column x. Does anybody have a hint here? Thanks!

gefragt vor 2 Jahren1538 Aufrufe
1 Antwort
0

Seems like you are reading Source A table and combining that with processed table B. In that case I would say simple overwrite the result in B and combine Table A and B all the time in glue job.

AWS
Zahid
beantwortet vor 2 Jahren
  • In that case I would have to process the entire table B all the time in order not to loose any records (e.g. that are not in the currently processed batch due to bookmarking). For large tables B that does not sound very efficient to me.

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen