Update Records with AWS Glue

0

I have two S3 buckets with data tables, namely A and B, and a Glue job, that transforms data from A to B. Both tables contain a column called x. The Glue job performs a GroupBy operation on this column x, which results in transforming all other columns from table A into list type columns for table B. I activate the bookmarking mechanism for the Glue job, so that it processes only new data. That requires, that I also read in inputs from table B (which are outputs of the previous run of the Glue job) in this job and append new items to the list type columns in case a record with a specific value for column x already exists. It is unclear for me how I could update the table B when saving outputs of the Glue job and avoid duplicated values of column x. Does anybody have a hint here? Thanks!

feita há 2 anos1538 visualizações
1 Resposta
0

Seems like you are reading Source A table and combining that with processed table B. In that case I would say simple overwrite the result in B and combine Table A and B all the time in glue job.

AWS
Zahid
respondido há 2 anos
  • In that case I would have to process the entire table B all the time in order not to loose any records (e.g. that are not in the currently processed batch due to bookmarking). For large tables B that does not sound very efficient to me.

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas