- 最新
- 最多得票
- 最多評論
Hello,
First some quick answers to your questions:
Is there a way to configure the backoff:
No
Can instances take themselves offline:
Yes:
- https://docs.aws.amazon.com/cli/latest/reference/autoscaling/terminate-instance-in-auto-scaling-group.html
- If you go this route, you'll probably want to suspend the AZRebalance process to prevent excessive churn in the ASG (Auto Scaling Group)
Is there a way to respond faster
Yes:
- Any UpdateAutoScalingGroup call (aka, most changes from the overview tab of the ASG console) will reset the backoff counter
Longer answer: There are two different asynchronous processes here, the alarm and the ASG. While the alarm is in the ALARM state, it will trigger the scaling policy once per minute, and the scaling policy will decide if the desired needs to be changed. In your case this happened right away (note that the alarm history only shows the entry from when the state changed, but the action is happening once a minute in the background)
The actual values for the ASG timers are internal, but I've provided some rough values here for context. After the desired has been changed, the ASG will periodically (less than a minute) check for changes between the desired and actual capacity to see if it needs to launch/terminate to make them match. If that fails it will keep retrying each period, and in some situations enter an exponential backoff state. Unless there are a lot of consecutive failures, this backoff will usually be less than an hour between attempts.
Since you mentioned 10 hours, I'm guessing that either:
- The instances stayed protected from scale-in for a long time
- The high usage alarm went off again after that and increased the desired back up to the original value (this wouldn't show in the ASG activity history, since there wasn't an actual launch or terminate event from it; but it would show on the high alarm). Example:
- ASG started at a desired of 8
- low alarm and policy moved desired to 7 (termination failed from scale-in protection)
- before scale-in protection was removed the high alarm set desired back to 8
- hours later the low alarm again lowers the desired to 7 and you see scaling happen
As a side note, you may want to look into the AWS Batch service if your running batch processing jobs
相關內容
- 已提問 6 個月前
- AWS 官方已更新 1 年前
- AWS 官方已更新 1 年前
- AWS 官方已更新 1 年前
If you check the ASG activity history, did it actually not scale for 10 hours? Or was the next message in CloudWatch not for 10 hours? CloudWatch will only record messages on state change, but it keeps retrying as long as the alarm stays in the ALARM state.
Did the Alarm stay in the ALARM state the whole time, or did it move back to OK at some point?
Both the ASG and CloudWatch will retry much more often then 10 hours, but the answers to these two questions will help me help you more clearly