How frequently does an ASG attempt to remove instances when current size is greater than desired?

0

I have an EC2 ASG that has size triggers based on CPU utilization. Usually, it follows the predictable pattern of scaling up during time of usage and removing instances as load decreases. My instances will sometimes mark themselves as protected from scale-in if they are working on something longer-running than their normal tasks. If all instances are protected, I'll get the message "Could not scale to desired capacity because all remaining instances are protected from scale-in" in cloud watch. It appears that following that message, the next scale-in attempt won't occur for quite a while - 10 hours later when this happened yesterday. Since my instances only protect themselves for a short amount of time, the scale in would have succeeded during most of that 10 hours.

My question: is there a way to configure the ASG so that it would retry the scale-in sooner than 10 hours later? Or is there a way I could respond to the failed attempt and maybe an instance could take itself off-line?

(I do understand that ideally the instances wouldn't protect themselves in the first place, and that's part of a larger update to the architecture. But a short-term fix to the existing solution would be great.)

To respond to the questions: The Alarm triggered based on low utilization and immediately reduced the desired count. At that point the alarm was no longer set. I'm looking at the ASG Activity History pane where there isn't anything in between message 1 that indicates that the desired size was reduced and that no instance could be removed and message 2 that a particular instance was removed due to a difference between current and desired.

  • If you check the ASG activity history, did it actually not scale for 10 hours? Or was the next message in CloudWatch not for 10 hours? CloudWatch will only record messages on state change, but it keeps retrying as long as the alarm stays in the ALARM state.

    Did the Alarm stay in the ALARM state the whole time, or did it move back to OK at some point?

    Both the ASG and CloudWatch will retry much more often then 10 hours, but the answers to these two questions will help me help you more clearly

asked 2 years ago575 views
1 Answer
0

Hello,

First some quick answers to your questions:

Is there a way to configure the backoff:

No

Can instances take themselves offline:

Yes:

Is there a way to respond faster

Yes:

  • Any UpdateAutoScalingGroup call (aka, most changes from the overview tab of the ASG console) will reset the backoff counter

Longer answer: There are two different asynchronous processes here, the alarm and the ASG. While the alarm is in the ALARM state, it will trigger the scaling policy once per minute, and the scaling policy will decide if the desired needs to be changed. In your case this happened right away (note that the alarm history only shows the entry from when the state changed, but the action is happening once a minute in the background)

The actual values for the ASG timers are internal, but I've provided some rough values here for context. After the desired has been changed, the ASG will periodically (less than a minute) check for changes between the desired and actual capacity to see if it needs to launch/terminate to make them match. If that fails it will keep retrying each period, and in some situations enter an exponential backoff state. Unless there are a lot of consecutive failures, this backoff will usually be less than an hour between attempts.

Since you mentioned 10 hours, I'm guessing that either:

  • The instances stayed protected from scale-in for a long time
  • The high usage alarm went off again after that and increased the desired back up to the original value (this wouldn't show in the ASG activity history, since there wasn't an actual launch or terminate event from it; but it would show on the high alarm). Example:
    • ASG started at a desired of 8
    • low alarm and policy moved desired to 7 (termination failed from scale-in protection)
    • before scale-in protection was removed the high alarm set desired back to 8
    • hours later the low alarm again lowers the desired to 7 and you see scaling happen

As a side note, you may want to look into the AWS Batch service if your running batch processing jobs

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions