Update Records with AWS Glue

0

I have two S3 buckets with data tables, namely A and B, and a Glue job, that transforms data from A to B. Both tables contain a column called x. The Glue job performs a GroupBy operation on this column x, which results in transforming all other columns from table A into list type columns for table B. I activate the bookmarking mechanism for the Glue job, so that it processes only new data. That requires, that I also read in inputs from table B (which are outputs of the previous run of the Glue job) in this job and append new items to the list type columns in case a record with a specific value for column x already exists. It is unclear for me how I could update the table B when saving outputs of the Glue job and avoid duplicated values of column x. Does anybody have a hint here? Thanks!

已提问 2 年前1537 查看次数
1 回答
0

Seems like you are reading Source A table and combining that with processed table B. In that case I would say simple overwrite the result in B and combine Table A and B all the time in glue job.

AWS
Zahid
已回答 2 年前
  • In that case I would have to process the entire table B all the time in order not to loose any records (e.g. that are not in the currently processed batch due to bookmarking). For large tables B that does not sound very efficient to me.

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则