- Newest
- Most votes
- Most comments
Unfortunately, there is no direct way at the ALB level to put a hard restriction on the number of connections a target will receive.
As you mentioned, Auto Scaling is an option, use Auto Scaling Group to dynamically adjust the number of instances based on traffic. If you haven't already explored Target Tracking(dynamic) and Predictive scaling policies, I'd suggest you to see which fits bets in your use case. I've been in same situation in past and I chose predictive scaling for my workloads as I could predict some of the traffic pattern.
More details, refer this AWS Documentation: Dynamic Scaling and AWS Documentation: Predictive Scaling.
Comment here if you have additional questions, happy to help.
Abhishek
The main problem with auto scaling is that by the time new instances are up, all the requests have already been handed to upstream servers, ie there's no way for them to offload the work they've been given. The load balancer seems like the correct place to handle a work queue.
ELB doesn't directly support queuing like you're looking for. You'd need to implement your own custom load balancer layer using something like Nginx or HAProxy to do that (and even with those, it might be complicated to setup)
I think a better option might be to setup a queue based system where you're able to make things a bit more async.
There's a bunch of StackOverflow discussions on it, but I've never implemented it, so not going to link to a specific one since I can't vouch for them. In general, the way you set this up would depend on how much control you have over the client.
- If the client is just a web browser or something else you can't control, you'd send a reply back to the client synchronously saying "we received your request, check back here<link> for results" - and let the end user check back on their own.
- If you control the client, you could synchronously pass back a token, and have the client know to automatically poll back periodically using that token
For the ELB/ASG side of things, you could also then split into a 2 tier setup, where there's a small frontend taking in these requests and responding to the followup polling requests from clients, but isn't actually processing the requests. The frontend just puts the request info in an SQS (or other) queue. The backend workers then pull jobs from that queue and you autoscale on messages visible in the queue, or on AverageMessagesPerWorker (Note: You can do something similar to the approach in this doc with metric math on a single scaling policy without needing to go through all the steps of publishing a custom metric)
Relevant content
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 months ago
Are you able to change the architecture around to have SQS (or some other queue) in between? Otherwise, load shedding by having the server sent a quick reply to the client asking them to retry might be your best bet
That's a good suggestion. A queue might be tricky since it would involved significant changes to our upstream architecture. That seems like the obvious approach if you have requests that are so slow that they clearly need to be async, such as video encoding, but do you have any links on typical ways to architect that when handling traditional synchronous HTTP requests?
EDIT: This turned into multiple long comments, just going to add an answer to the question