Data processing best practices

0

Hello everyone. Data from the rest api in the form of JSON is loaded daily by lambda into s3-bucket-1. Then this data should be stored in s3-bucket-2 in the form of a flat parquet table.

I did it in glue-job, but there are two questions: 1 - Lambda updates only some partitions daily (id=parameter). How can I make glue-job process only updated data too? 2 - glue-job always creates a new file as a result, respectively, the data is duplicated. How to avoid this? (delete existing files before writing new ones, as an option)

Glue-job was compiled in a visual editor, I did not find the necessary settings. Do I understand correctly that this is solved only by code?

In general, what are the best practices for such a process? Overwrite files or create a new version every time, and filter the latest one when reading? Did I choose glue-job correctly for this?

Aucune réponse

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions