How to batch delta json files from S3 follwing standart Kinesis Firehose partition?

Hello, I am using Kinesis Firehose and saving the raw streamed data into json files in S3. I am using the standard firehose partition <stream_name>/YYYY/MM/DD/HH.

For the data that is really urgent as soon as the file is saved into s3 a lambada function is triggered to process the data in the file. Other data doesn't have the same urgency so we can do batches every 5 or 10 minutes. My question is related to the data that can be processed in batches.
I don't know what processing strategy or methodology I should implement so every time the batch runs it will only process the json files that have not been processed before. For example it is 2022-01-28 14:15:00 . We have 2 files in the same partition. My process runs and loads those 2 files. Then at 2022-01-28 14:25:00 the process runs again and there are 3 files in the partition. The previous batch already processed 2 of those files so the new batch should only process one file. How can I know which files were already processed so I don't read them again on my next batch?
I was planning to use Airflow to schedule some spark jobs to do the batch processing. What tool or technology would you recommend for doing this kind of batches?

Temas

Almacenamiento Análisis Cálculo

Etiquetas

Amazon S3 Glacier Amazon Data Firehose AWS Batch

Idioma

English

AWS-User-7955457

preguntada hace 2 años110 visualizaciones

No hay respuestas

Más nuevo
Más votos
Más comentarios

Contenido relevante

¿Cómo puedo enviar registros de Amazon CloudWatch entre cuentas a Kinesis Data Firehose?
OFICIAL DE AWSActualizada hace un año
¿Cómo puedo configurar la transmisión entre cuentas desde Kinesis Data Firehose a Amazon OpenSearch Service?
OFICIAL DE AWSActualizada hace 5 meses
¿Por qué Kinesis Data Firehose crea tantos archivos pequeños en S3?
OFICIAL DE AWSActualizada hace 2 años
¿Cómo puedo activar los registros de AWS WAF y enviarlos a CloudWatch, Amazon S3 o Kinesis Data Firehose?
OFICIAL DE AWSActualizada hace 2 años