Data processing best practices

0

Hello everyone. Data from the rest api in the form of JSON is loaded daily by lambda into s3-bucket-1. Then this data should be stored in s3-bucket-2 in the form of a flat parquet table.

I did it in glue-job, but there are two questions: 1 - Lambda updates only some partitions daily (id=parameter). How can I make glue-job process only updated data too? 2 - glue-job always creates a new file as a result, respectively, the data is duplicated. How to avoid this? (delete existing files before writing new ones, as an option)

Glue-job was compiled in a visual editor, I did not find the necessary settings. Do I understand correctly that this is solved only by code?

In general, what are the best practices for such a process? Overwrite files or create a new version every time, and filter the latest one when reading? Did I choose glue-job correctly for this?

Anton
질문됨 9달 전70회 조회
답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠