- 最新
- 投票最多
- 评论最多
It depends on your use case. If this is more of a pre-processing than it can be beneficial to move all the valid files to the final destination and all the invalid files in another folder. Moving the files gives you the visibility which files are processed and which not.
I would definitely have one folder/prefix where files are delivered to; then processed files are put somewhere else. If you put the processed files back in the original folder then the Lambda will (maybe, but likely) be triggered again. To avoid this you'll have to have some logic in the Lambda function that detects when it is processing a new file; or you'll have to rely on file extensions. Much easier to avoid this situation (and potentially having a circular loop which invokes many Lambda functions) but using a different folder.
I worked on a similar architecture in the past. We had coupe of issues I would like to highlight with the S3 >> Lambda.
- For lambda updates/deployment, we had to wait for off hours. If any files were delivered that time, we had to copy again those to initiate the pipeline.
- Same for Lambda failure, we had to find the file/object and copy again to initiate the pipeline.
With S3 >> SQS >> Lambda,
- For Continuous Delivery, during deployment the unprocessed messages (files) will be available at SQS for the set retention time. No, manual effort needed.
- SQS activate retries and dequeue, again failures available at given time.
- If at all you want to decouple the file upload and start or a delayed start. You can have source team/system delivering the file and start the process only after the message to SQS. Modern data pipeline platform offers the ability to notify after successful event.
Back to the first part of the question, yes having different folders would help. We had Staging, Processed and Failure.