Data processing best practices

0

Hello everyone. Data from the rest api in the form of JSON is loaded daily by lambda into s3-bucket-1. Then this data should be stored in s3-bucket-2 in the form of a flat parquet table.

I did it in glue-job, but there are two questions: 1 - Lambda updates only some partitions daily (id=parameter). How can I make glue-job process only updated data too? 2 - glue-job always creates a new file as a result, respectively, the data is duplicated. How to avoid this? (delete existing files before writing new ones, as an option)

Glue-job was compiled in a visual editor, I did not find the necessary settings. Do I understand correctly that this is solved only by code?

In general, what are the best practices for such a process? Overwrite files or create a new version every time, and filter the latest one when reading? Did I choose glue-job correctly for this?

Anton
已提问 9 个月前70 查看次数
没有答案

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则