I'm working to use Lambda as our primary computation environment. So
far, that amounts to funneling data ingested via the API Gateway to
various endpoints (often similar in effect to the AWS IoT rules
engine) and using DynamoDB to store configuration data.
The obstacle I'm currently grappling with is the throughput limits on
DynamoDB. In standard operation, we have a slow, steady stream of
requests that don't begin to approach our limits. However, on rare
occasions, I'll need to add a large data store. As things are set up,
that translates to a large number of near simultaneous requests into
DynamoDB. However, we don't have a latency requirement. Within reason,
I don't care when this operation completes, just that it does. If I
could space these requests to stay below our limits, the problem would
be solved.
In essence, I want our burst response to distribute the load over time
as opposed to scaling up our systems.
Initially, I was trying to setup a scheduler, a function I could call
to simply say "Try this lambda function again in X.Y minutes" with
CloudWatch Events. However, I ran into a different limitation there of
only being able to make 5 CloudWatch API requests per second. I didn't
solve the throughput issue so much as move it to a different service.
I have a couple different ways of solving this specific problem, but
the overall scheduling design pattern was one I'm really interested
in.
My initial thought is to introduce SQS between the API Gateway-fronted Lambda. That Lambda would write the payload to SQS, then use CloudWatch metrics to kick off an additional Lambda to process messages from the queue when the queue depth is greater than zero. If there is an issue writing to DynamoDB, the message simply not be removed from the queue and it can be processed later.