AWS Autoscaling rebalance starts to foten!

0

Hello

We are facing with an issue of to often rebalancing in autoscaling group for our gitlab-runner. Isntance size is c5.xlarge (spot lifecicle, of course) Invocation of rebalance happens almost every minute. We have 3 Availability zones and 6 subnets in total attached. If i am not mistaken that issue is not related with a lack of resource in one specific AZ, cause we are using 3 AZ simultaniously but still have a problem. Also we are using rebalance option enabled with another ASG, but there are instances with types from t3-micro up to t3.small or even t3.medium and everything works smoothly (for now). Looks loke issue happens only with the compute optimized instance types like C5.xlarge. Any ideas how it could be solved? Could it be that Aws has not enough resources for AWS C5.xlarge in concrete region? We are in eu-west2. Or may be change from instance type c5 to c5n or c5a could fix this issue? Thanks.

profile picture
asked 6 months ago225 views
1 Answer
0
  1. What kind of rebalancing is happening (it should say in the activity history message). It sounds like this is from Spot rebalance notifications? Or is there also AZ Rebalancing happening?
  2. What allocation strategy are you using? You should not use lowest-price with capacity rebalancing enabled, since the rebalance launch can end up going right back into a low capacity pool. We recommend using price-capacity-optimized for most ASG usecases, as this strategy intelligently balances both price and capacity of each instance type when launching.
  3. How many instance types are in the ASG? Are you using attributes to define them, or an explicit list? When using spot, we recommend having 10 instance types at a minimum (with of course more being better to reduce capacity related issues). If you're using Attributes to define your list, make sure you've included the Price Protection attributes to include all matching instance types
  4. Is there a MaxPrice set? This increases changes of interruption, since only instance types

When the rebalance launch is attempted, the ASG will first try to launch instances into the same AZ as the termination is going to happen in (to maintain AZ balance). So if the instance which received the rebalance notification is in AZ 1a, the replacement will also be launched in 1a if there's available spot capacity. The ASG will only fail over to try one of your other AZs if EC2 doesn't return capacity in the first one being attempted.

AWS
answered 6 months ago
profile picture
EXPERT
reviewed 24 days ago
  • Hi Answering the qeustions:

    1. I can see only this in events in autoscaling group when rebalance happens: "an instance was taken out of service in response to an EC2 instance rebalance recommendation."
    2. We do not use any allocation strategy, cause we use only one type of instance c5.xlarge. At least i can not find any reference in AWS console in ASG and template about the allocation strategy used. These settings just absent there. I even used command aws autoscaling describe-auto-scaling-groups --auto-scaling-group-name 'gitlab-runner' --output text | grep -i "*startegy*" and it just returns nothing.
    3. Single instance with no any atributes. Just default template with predefined custom image, subnet, security group and some binaries installed.
    4. There is no such option enabled at the moment. We used it before, but then after market price limit has been reached some how and we lost runner that time and decided to refuse from idea to limit the price, and now this option is just unset. So now we have this instance constantly, but with enabled option in AWS cli "--capacity-rebalance" instance is recreating each minute, but with health status everything is fine. If to remove rebalancing (just update ASG group with paramater "--no-capacity-rebalance" this instance works stable (without recreation) and we can use it a day or even more till the moment when instance spot request is closed. We use one-time spot request, so there is no any expiration date set.
  • Thanks for that additional info! Based on that, what's likely happening is:

    1. The instance in a given AZ is getting a rebalance recommendation from being at an elevated risk of interruption, and the ASG tries to launch a replacement.
    2. There is still enough capacity in the same AZ where a replacement is able to be launched
    3. A loop happens

    While EC2 and ASG do try and prevent behaviors like this, the feature is designed around having multiple instance types in an ASG for spot to move between: https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-capacity-rebalancing.html#capacity-rebalancing-behavior

    I would recommend you look into adding multiple instance types into the ASG for use with spot. If that's not possible for your workload, then disabling Capacity Rebalance will reduce the churn you're seeing; but it means spot instances will be interrupted without any proactive action from the ASG

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions