Serverless file processing

0

Hi, I am building an architecture following the serverless pattern, scenario file comes to the S3 folder, S3 event triggers Lambda to validate & process the file... do we need a staging folder in S3 to perform our business operations? i am thinking to read the file from S3 landing folder where files arrives and perform the validation & business operation to move the files to duplicate or complete folder... in this journey does it makes sense to have a staging folder other than landing folder? Which one will be suitable to trigger lambda from S3? S3 events --> Lambda S3 events --> SQS --> Lambda

thank you

asked 2 years ago397 views
3 Answers
0

It depends on your use case. If this is more of a pre-processing than it can be beneficial to move all the valid files to the final destination and all the invalid files in another folder. Moving the files gives you the visibility which files are processed and which not.

AWS
Marco
answered 2 years ago
0

I would definitely have one folder/prefix where files are delivered to; then processed files are put somewhere else. If you put the processed files back in the original folder then the Lambda will (maybe, but likely) be triggered again. To avoid this you'll have to have some logic in the Lambda function that detects when it is processing a new file; or you'll have to rely on file extensions. Much easier to avoid this situation (and potentially having a circular loop which invokes many Lambda functions) but using a different folder.

profile pictureAWS
EXPERT
answered 2 years ago
0

I worked on a similar architecture in the past. We had coupe of issues I would like to highlight with the S3 >> Lambda.

  • For lambda updates/deployment, we had to wait for off hours. If any files were delivered that time, we had to copy again those to initiate the pipeline.
  • Same for Lambda failure, we had to find the file/object and copy again to initiate the pipeline.

With S3 >> SQS >> Lambda,

  • For Continuous Delivery, during deployment the unprocessed messages (files) will be available at SQS for the set retention time. No, manual effort needed.
  • SQS activate retries and dequeue, again failures available at given time.
  • If at all you want to decouple the file upload and start or a delayed start. You can have source team/system delivering the file and start the process only after the message to SQS. Modern data pipeline platform offers the ability to notify after successful event.

Back to the first part of the question, yes having different folders would help. We had Staging, Processed and Failure.

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions