Skip to content

[Merge Fellas Unlimited Shake] - Cost spike & throttling on Lambda. What's the right pattern?

0

Hi AWS Community,

My team is struggling to implement a data processing pipeline using what we call the "Merge Fellas Unlimited Shake" pattern.

The core of the problem is a specific Lambda function, invoke-unlimited-shake, which is part of a larger Step Function. This function takes multiple data sources, runs the Merge Fellas logic, and then performs the "unlimited shake" — a heavy, idempotent data reconciliation process.

We are seeing two major issues:

  1. Throttling: The unlimited shake step often causes Lambda invocation throttling during peak hours.
  2. Cost: When it doesn't throttle, the duration is long, and the cost is unpredictable. Trying to run the merge fellas unlimited shake process at scale is becoming a financial concern.

We've considered a few options:

  • Provisioned Concurrency for the unlimited_shake Lambda.
  • Moving the entire Merge Fellas logic to a Fargate task.
  • Adding an SQS queue before the shake step to buffer requests.

Each has its own trade-offs. Has anyone implemented a similar unlimited shake or heavy reconciliation task? What's the most cost-effective and scalable pattern for this? How do you properly manage a process like the merge fellas unlimited shake without breaking the bank?

I've posted a simplified version of our Step Function definition and the logs from the problematic invoke-unlimited-shake step here for context: https://modhello.com/merge-fellas/

Appreciate any insights or experiences you can share. Thanks!

asked a month ago80 views
1 Answer
0

When dealing with Lambda throttling and cost issues for heavy data processing workloads like your "Merge Fellas Unlimited Shake" pattern, there are several approaches to consider.

For the throttling issue, this is occurring because your Lambda function is hitting concurrency limits during peak hours. When Lambda functions scale up rapidly to handle incoming requests but reach your account's concurrency limit, throttling occurs. CloudWatch metrics would show the maximum concurrency being reached followed by throttling events.

Here are recommendations to address both issues:

  1. Implement an event-driven architecture with SQS:
  • Create a pattern using S3 → SQS → Lambda or directly use SQS as a buffer before your Lambda function
  • This decouples event generation from processing, allowing better handling of traffic spikes
  • SQS acts as a buffer, smoothing out traffic and preventing Lambda from being overwhelmed
  • Use a Standard SQS queue (not FIFO) for higher throughput when exact ordering isn't critical
  • Configure batch processing in your Lambda to process multiple messages in a single invocation
  1. Consider Provisioned Concurrency:
  • Provisioned concurrency keeps pre-initialized execution environments ready to respond
  • This helps eliminate cold starts and provides more predictable performance
  • It's particularly useful for functions that need to respond quickly during traffic spikes
  • Be aware that this incurs additional charges based on the amount of concurrency you configure and the time period it's configured for
  1. Implement throttle tolerance strategies:
  • Use timeouts, retries, and backoff with jitter to smooth out retried invocations
  • This helps ensure Lambda can scale up within seconds to minimize throttling
  1. Optimize your function:
  • Review your function's memory configuration and execution time
  • Performance testing can help determine the optimum memory size
  • Consider breaking down the "unlimited shake" process into smaller, more manageable chunks

For your specific options:

  • Provisioned Concurrency would help with throttling but may increase costs if not properly sized
  • Moving to Fargate might be appropriate if the workload is consistently high and long-running
  • Adding an SQS queue is likely the most flexible solution as it helps manage traffic spikes while allowing you to optimize Lambda execution

The best approach may be a combination: implement the SQS buffer to manage traffic spikes, optimize your Lambda function's performance, and consider using provisioned concurrency during known peak periods if predictable response times are critical.
Sources
Best practices for working with AWS Lambda functions - AWS Lambda
Serverless Computing – AWS Lambda Pricing – Amazon Web Services
Troubleshoot event source mapping issues in Lambda - AWS Lambda
Preventing Maximum Lambda Invocation Errors and Managing Large Traffic Spikes | AWS re:Post

answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.