How to batch delta json files from S3 follwing standart Kinesis Firehose partition?

Hello, I am using Kinesis Firehose and saving the raw streamed data into json files in S3. I am using the standard firehose partition <stream_name>/YYYY/MM/DD/HH.

For the data that is really urgent as soon as the file is saved into s3 a lambada function is triggered to process the data in the file. Other data doesn't have the same urgency so we can do batches every 5 or 10 minutes. My question is related to the data that can be processed in batches.
I don't know what processing strategy or methodology I should implement so every time the batch runs it will only process the json files that have not been processed before. For example it is 2022-01-28 14:15:00 . We have 2 files in the same partition. My process runs and loads those 2 files. Then at 2022-01-28 14:25:00 the process runs again and there are 3 files in the partition. The previous batch already processed 2 of those files so the new batch should only process one file. How can I know which files were already processed so I don't read them again on my next batch?
I was planning to use Airflow to schedule some spark jobs to do the batch processing. What tool or technology would you recommend for doing this kind of batches?

주제

스토리지 분석 컴퓨팅

태그

아마존 S3 글레시어 Amazon Data Firehose AWS 배치

언어

English

AWS-User-7955457

질문됨 2년 전112회 조회

답변 없음

최신
최다 투표
가장 많은 댓글

How to batch delta json files from S3 follwing standart Kinesis Firehose partition?

관련 콘텐츠