How to batch delta json files from S3 follwing standart Kinesis Firehose partition?
0
Hello,
I am using Kinesis Firehose and saving the raw streamed data into json files in S3. I am using the standard firehose partition <stream_name>/YYYY/MM/DD/HH.
For the data that is really urgent as soon as the file is saved into s3 a lambada function is triggered to process the data in the file.
Other data doesn't have the same urgency so we can do batches every 5 or 10 minutes.
My question is related to the data that can be processed in batches.
I don't know what processing strategy or methodology I should implement so every time the batch runs it will only process the json files that have not been processed before.
For example it is 2022-01-28 14:15:00 . We have 2 files in the same partition. My process runs and loads those 2 files. Then at 2022-01-28 14:25:00 the process runs again and there are 3 files in the partition. The previous batch already processed 2 of those files so the new batch should only process one file.
How can I know which files were already processed so I don't read them again on my next batch?
I was planning to use Airflow to schedule some spark jobs to do the batch processing. What tool or technology would you recommend for doing this kind of batches?