Auto Scaling: Refresh instances without downtime

0

I'm using AWS Auto scaling group with AWS ALB and the following settings: Desired capacity: 1 Minimum capacity: 1 Maximum capacity: 3

When I now start a "Instance refresh" (with Minimum healthy percentage=100%) for the autoscaling group, the one and only healthy instance is terminated before the new refreshed instance is ready/healthy which results in downtime of my service.

When I set desired capacity to "2" and start instance refresh, the service keeps available.

How can I achieve that instance refresh first starts a second instance, waits until it is ready and after that terminates the previous old instance so that desired capacity can be 1?

Max
asked a year ago2648 views
2 Answers
1
Accepted Answer

UPDATE

Instance Maintenance Policy is now live! This allows you to set a Min Healthy Percentage on the ASG (or as part of your instance refresh). Setting MinHealthyPercent to 100% means the group will now launch replacement instances before terminating old ones during most replacement processes. There's also an accompanying Blog post for more details

Original answer below

Currently an instance refresh will always start terminating and launching the replacement instance at the same time, causing downtime in a single instance ASG: https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-instance-refresh.html#instance-refresh-limitations

There is an internal feature request for fully launching the new instance first before starting to terminate the old one, which I've added this post to as a +1

For now you'll need to set the desired to 2 first before running the instance refresh, or if the ASG is in a single AZ you can just lower the desired back down to 1 after the new instance is done launching for a similar outcome.

AWS
answered a year ago
profile pictureAWS
EXPERT
reviewed 9 months ago
0

Hi,

Basically in the case of 1 instance, we can't really use the refresh instance mechanism if we want zero downtime. What I found people doing is what Shahad suggested. I.e. : determine that you are running only 1 instance -> increase the desired capacity --> decrease it back. The problem that I'm having with this solution is the timing. In my case, It's a lambda that triggers the Instance refresh after a new AMI is available. So, in the case of 1 running instance, that lambda could indeed increase the desired capacity to 2, but does not it have to wait until that instance is fully available before it resets the desired capacity back to 1 on the ASG otherwise we are back to the dreaded down time....? I could force the lambda to wait for that time with a loop checking on the status, but that's everything but elegant not to mention the billing time that, that incurs... Any ideas? Any event we could listen for....

I have read about the 'create_before_destroy' terraform solution which basically duplicate the whole ASG to ensure high availability. What do you guys think about it?

Mehdi
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions