Preventing Maximum Lambda Invocation Errors and Managing Large Traffic Spikes

0

Hi AWS,

Here is the improved version of the follow up question https://repost.aws/questions/QUfdvnhP_1TQKWkpLWxEjZjQ/high-frequency-of-aws-cloudwatch-alarm-for-invoking-lambda-functions.

We have created a Lambda function in our production environment whose purpose is to copy files from s3 bucket in one platform (source account) to s3 bucket into another platform (destination account).

The lambda function processes an event triggered by an s3 event (object upload to s3 bucket). The function extract details from the event (source bucket and event key), processes this information and copies the object from the source s3 bucket to a destination s3 bucket after modifying its folder structure and object key. It also handles different environments and performs some conditional logic based on the naming conventions of the source folder. This works well with files that are sent with low frequency but the files that arrive with higher frequency are causing throttling issues resulting into larger spikes and threshold has exceeded the par limit set. This causes the lambda function to not scale with the same speed as the events in s3 bucket and there are a certain service limits for lambda functions at account level:

Concurrency scaling rate --> 1000 Concurrency executions --> 5000

The CW alarm attached is monitoring the activity on lambda function capturing error at higher rates because of which it has been disabled.

We tried to resolve this issue by proposing some alternatives which won't stress the lambda and the spikes would be lesser or normal. Some of these are:

  1. S3 Replication (but as the lambda function is parsing the naming convention so that functionality won't be achieved using replication)
  2. Using Python Glue job
  3. Create EDA using S3 -> SQS -> Lambda (which helps to scale up lambda with the same speed to process events in s3 bucket). Which SQS would be the best fit Standard/FIFO.

Can you please help me which approach I will go with avoiding any downtime considering it's a production change which might affect the business. And if there is any other alternative which could be better please let me know.

2 Answers
0

Hi,

My suggestion is to:

  1. Use S3 Replication to copy to the target bucket: it will scale very high with high resilience at no effort on your side (much less work than Lambda)
  2. Use a S3 Trigger to start a Lambda ( see https://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html) when the object is replicated on target bucket: this lambda will "modify its folder structure and object key" as per your own word.

It will work fine to have trigger on target since S3 folders do not really exist and are just a mental model for users: see https://medium.com/@chamaln/why-s3-folders-dont-really-exist-35f29cf70477

Best,

Didier

profile pictureAWS
EXPERT
answered 4 days ago
  • Hi, but the lambda function is doing some parsing for e.g. changing file names from hypen (-) to underscore (_) which is not possible using S3 replication as for that we need to write a logic code for which AWS Lambda is the suitable service.

  • Yes, that's exactly my point: 1/ you replicate to target bucket with original name via S3 replication 2/ when reaching destination via Replication, the file will be processed by the Lambda trigger where you can change hyphen to underscore, etc. My proposal boils to doing name transformation after replication rather than before as you currently do.

0

To address the issue of Lambda throttling and scaling problems when dealing with high-frequency file uploads to S3, I recommend implementing an event-driven architecture using S3 -> SQS -> Lambda. This approach will help manage large traffic spikes and prevent maximum Lambda invocation errors. Here's why this solution is effective and how to implement it:

  1. Use S3 event notifications to send messages to an SQS queue when new objects are uploaded.

  2. Configure your Lambda function to use the SQS queue as its event source instead of directly triggering from S3 events.

  3. Use a Standard SQS queue rather than a FIFO queue. Standard queues offer higher throughput and are better suited for this use case where exact ordering is not critical.

  4. Implement batch processing in your Lambda function. Configure the Lambda to process multiple messages from the queue in a single invocation. This reduces the number of Lambda invocations and helps manage concurrency more effectively.

  5. Set up a batch window for your Lambda function. AWS Lambda supports batch windows of up to 5 minutes for functions triggered by SQS. This allows you to accumulate multiple file events before processing them, which can help smooth out traffic spikes.

  6. Adjust the visibility timeout and message retention period of your SQS queue to ensure messages are not lost if processing takes longer than expected.

  7. Monitor the SQS queue depth and Lambda concurrency metrics in CloudWatch. Set up alarms to notify you if the queue starts to back up or if Lambda concurrency approaches its limit.

This approach offers several advantages:

  • It decouples the event generation (S3 uploads) from event processing (Lambda function), allowing better handling of traffic spikes.
  • SQS acts as a buffer, smoothing out traffic and preventing Lambda from being overwhelmed during high-volume periods.
  • It allows for more efficient use of Lambda concurrency, as you can process multiple files in a single invocation.
  • It's more resilient to failures, as unprocessed messages remain in the queue and can be retried.

Implementing this solution should not cause significant downtime, as you can gradually transition to the new architecture:

  1. Set up the SQS queue and configure S3 event notifications to send to both the existing Lambda function and the new SQS queue.
  2. Create a new version of your Lambda function that processes events from SQS instead of directly from S3.
  3. Test the new setup thoroughly in a non-production environment.
  4. Once confident, switch the production Lambda to use the SQS queue as its event source.
  5. Monitor closely and adjust configurations as needed.

This approach should significantly reduce the stress on your Lambda function and help manage large traffic spikes more effectively.
Sources
AWS Lambda functions now scale 12 times faster when handling high-volume requests | AWS News Blog
Multiple Uploads to S3 trigger the Lambda Multiple times | AWS re:Post
Lambda behavior | AWS re:Post

profile picture
answered 4 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions