Data processing best practices

0

Hello everyone. Data from the rest api in the form of JSON is loaded daily by lambda into s3-bucket-1. Then this data should be stored in s3-bucket-2 in the form of a flat parquet table.

I did it in glue-job, but there are two questions: 1 - Lambda updates only some partitions daily (id=parameter). How can I make glue-job process only updated data too? 2 - glue-job always creates a new file as a result, respectively, the data is duplicated. How to avoid this? (delete existing files before writing new ones, as an option)

Glue-job was compiled in a visual editor, I did not find the necessary settings. Do I understand correctly that this is solved only by code?

In general, what are the best practices for such a process? Overwrite files or create a new version every time, and filter the latest one when reading? Did I choose glue-job correctly for this?

Anton
asked 9 months ago68 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions