How to batch delta json files from S3 follwing standart Kinesis Firehose partition?

Hello, I am using Kinesis Firehose and saving the raw streamed data into json files in S3. I am using the standard firehose partition <stream_name>/YYYY/MM/DD/HH.

For the data that is really urgent as soon as the file is saved into s3 a lambada function is triggered to process the data in the file. Other data doesn't have the same urgency so we can do batches every 5 or 10 minutes. My question is related to the data that can be processed in batches.
I don't know what processing strategy or methodology I should implement so every time the batch runs it will only process the json files that have not been processed before. For example it is 2022-01-28 14:15:00 . We have 2 files in the same partition. My process runs and loads those 2 files. Then at 2022-01-28 14:25:00 the process runs again and there are 3 files in the partition. The previous batch already processed 2 of those files so the new batch should only process one file. How can I know which files were already processed so I don't read them again on my next batch?
I was planning to use Airflow to schedule some spark jobs to do the batch processing. What tool or technology would you recommend for doing this kind of batches?

Themen

Speicher Analysen Kalkulation

Relevanter Inhalt

Wie behebe ich den Fehler „Zugriff verweigert“ in Kinesis Data Firehose, wenn ich in einen Amazon S3-Bucket schreibe?
AWS OFFICIALAktualisiert vor 4 Jahren
Wie kann ich Amazon CloudWatch Logs kontoübergreifend an Kinesis Data Firehose übertragen?
AWS OFFICIALAktualisiert vor einem Jahr
Wie kann ich Integrationsprobleme mit Lambda Kinesis Firehose beheben?
AWS OFFICIALAktualisiert vor einem Jahr
Warum erstellt Kinesis Data Firehose so viele kleine Dateien in S3?
AWS OFFICIALAktualisiert vor 2 Jahren