Auto Scaling: Refresh instances without downtime

0

I'm using AWS Auto scaling group with AWS ALB and the following settings: Desired capacity: 1 Minimum capacity: 1 Maximum capacity: 3

When I now start a "Instance refresh" (with Minimum healthy percentage=100%) for the autoscaling group, the one and only healthy instance is terminated before the new refreshed instance is ready/healthy which results in downtime of my service.

When I set desired capacity to "2" and start instance refresh, the service keeps available.

How can I achieve that instance refresh first starts a second instance, waits until it is ready and after that terminates the previous old instance so that desired capacity can be 1?

Max
已提問 1 年前檢視次數 2712 次
2 個答案
1
已接受的答案

UPDATE

Instance Maintenance Policy is now live! This allows you to set a Min Healthy Percentage on the ASG (or as part of your instance refresh). Setting MinHealthyPercent to 100% means the group will now launch replacement instances before terminating old ones during most replacement processes. There's also an accompanying Blog post for more details

Original answer below

Currently an instance refresh will always start terminating and launching the replacement instance at the same time, causing downtime in a single instance ASG: https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-instance-refresh.html#instance-refresh-limitations

There is an internal feature request for fully launching the new instance first before starting to terminate the old one, which I've added this post to as a +1

For now you'll need to set the desired to 2 first before running the instance refresh, or if the ASG is in a single AZ you can just lower the desired back down to 1 after the new instance is done launching for a similar outcome.

AWS
已回答 1 年前
profile pictureAWS
專家
已審閱 9 個月前
0

Hi,

Basically in the case of 1 instance, we can't really use the refresh instance mechanism if we want zero downtime. What I found people doing is what Shahad suggested. I.e. : determine that you are running only 1 instance -> increase the desired capacity --> decrease it back. The problem that I'm having with this solution is the timing. In my case, It's a lambda that triggers the Instance refresh after a new AMI is available. So, in the case of 1 running instance, that lambda could indeed increase the desired capacity to 2, but does not it have to wait until that instance is fully available before it resets the desired capacity back to 1 on the ASG otherwise we are back to the dreaded down time....? I could force the lambda to wait for that time with a loop checking on the status, but that's everything but elegant not to mention the billing time that, that incurs... Any ideas? Any event we could listen for....

I have read about the 'create_before_destroy' terraform solution which basically duplicate the whole ASG to ensure high availability. What do you guys think about it?

Mehdi
已回答 1 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南