Data processing best practices

0

Hello everyone. Data from the rest api in the form of JSON is loaded daily by lambda into s3-bucket-1. Then this data should be stored in s3-bucket-2 in the form of a flat parquet table.

I did it in glue-job, but there are two questions: 1 - Lambda updates only some partitions daily (id=parameter). How can I make glue-job process only updated data too? 2 - glue-job always creates a new file as a result, respectively, the data is duplicated. How to avoid this? (delete existing files before writing new ones, as an option)

Glue-job was compiled in a visual editor, I did not find the necessary settings. Do I understand correctly that this is solved only by code?

In general, what are the best practices for such a process? Overwrite files or create a new version every time, and filter the latest one when reading? Did I choose glue-job correctly for this?

Anton
gefragt vor 9 Monaten70 Aufrufe
Keine Antworten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen