How to process millions of files concurrently with Lambda

0

I have millions of files in an S3 bucket and a Lambda function that processes a single file. What is the best way to process all these files as quickly as possible? It's not a one-time issue but something that I want to automate for future runs as well.

I understand that Lambda by default can run a function up to 1000 times concurrently, I just wonder what the best way to trigger all this is? Can I have S3 generate an event per file, or should I write another Lambda function to traverse the bucket and then send each file to the processing function? Or is there some clever other service that can orchestrate this?

jvdb
질문됨 6달 전527회 조회
2개 답변
2
수락된 답변

Processing millions of files in S3 is exactly what Step Functions Distributed Map was built for.

profile picture
답변함 6달 전
profile pictureAWS
전문가
Uri
검토됨 6달 전
2

I would advise a two-part solution for this scenario. First, for automating the processing of these files in the future, you can set up a trigger for your lambda so that it is invoked every time a new file is uploaded to the S3 bucket. One of the simpler approaches would be to use Eventbridge as shown in this example: https://serverlessland.com/patterns/s3-eventbridge. Alternatively, depending on the frequency these files will be uploaded, a solution using SQS to trigger the Lambda may be preferred to avoid any issues with invocation limits (e.g. https://serverlessland.com/patterns/s3-sqs-lambda) As for processing the existing files, there again are multiple possible solutions and which is optimal will depend on details of the processing. One solution as you mention would be to have a separate Lambda that will traverse the bucket and invoke the processing Lambda on each file while managing concurrency. A drawback to this approach is the parent lambda may run into the execution time limit. Similarly, a script with the same logic to traverse the bucket and invoke lambdas could be run on an EC2 machine or your local machine as a one-time process. One final approach could be to first implement the future automation using the SQS solution. Then write a separate Lambda that will traverse the bucket and rather than invoke the Lambda directly, place an event in the SQS queue. This removes the need to manage Lambda concurrency from the script, though Lambda execution timeout should still be considered.

AWS
답변함 6달 전
profile pictureAWS
전문가
검토됨 6달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠