How to batch delta json files from S3 follwing standart Kinesis Firehose partition?

0

Hello, I am using Kinesis Firehose and saving the raw streamed data into json files in S3. I am using the standard firehose partition <stream_name>/YYYY/MM/DD/HH.

For the data that is really urgent as soon as the file is saved into s3 a lambada function is triggered to process the data in the file. Other data doesn't have the same urgency so we can do batches every 5 or 10 minutes. My question is related to the data that can be processed in batches.
I don't know what processing strategy or methodology I should implement so every time the batch runs it will only process the json files that have not been processed before. For example it is 2022-01-28 14:15:00 . We have 2 files in the same partition. My process runs and loads those 2 files. Then at 2022-01-28 14:25:00 the process runs again and there are 3 files in the partition. The previous batch already processed 2 of those files so the new batch should only process one file. How can I know which files were already processed so I don't read them again on my next batch?
I was planning to use Airflow to schedule some spark jobs to do the batch processing. What tool or technology would you recommend for doing this kind of batches?

asked 2 years ago109 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions