Target-tracking-metric autoscaling policy causes thrashing

0

Hi, We have a target-tracking-metric autoscaling policy that's created almost verbatim to this scaling policy: https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-target-tracking-metric-math.html This policy has been working fine for quite some time (~1 year), until it caused a recent incident. We observed thrashing behaviors that seems to be unexplainable, until all instances in the autoscaling policy group were manually killed and rebooted.

During the incident, we see multiple notifications such as the below, where there would be scaling out from 1->8 and immediate termination of these instances from 8->1 . This caused our service to not be able to run due to instances kept getting killed and relaunched.

Successful Terminating EC2 instance: <i-> At 2025-01-13T19:18:47Z a monitor alarm <TargetTracking-> in state ALARM triggered policy <VisibleMessage-Scale-> changing the desired capacity from 8 to 1. At 2025-01-13T19:18:50Z an instance was taken out of service in response to a difference between desired and actual capacity, shrinking the capacity from 8 to 1. At 2025-01-13T19:18:50Z instance <i-> was selected for termination. At 2025-01-13T19:18:50Z instance <i-> was selected for termination. At 2025-01-13T19:18:50Z instance <i-> was selected for termination. At 2025-01-13T19:18:50Z instance <i-> was selected for termination. At 2025-01-13T19:18:50Z instance <i-> was selected for termination. At 2025-01-13T19:18:51Z instance <i-> was selected for termination. At 2025-01-13T19:18:51Z instance <i-> was selected for termination. 2025 January 13, 02:18:50 PM -05:00 Successful Launching a new EC2 instance: <i-> At 2025-01-13T19:15:47Z a monitor alarm <TargetTracking-> in state ALARM triggered policy <VisibleMessage-Scale> changing the desired capacity from 1 to 8. At 2025-01-13T19:15:53Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 1 to 8. 2025 January 13, 02:15:54 PM -05:00

Attached are also pictures of the alarm metric that was tracked. Here, the m2 metric of instances running fluctuates between 1 and 8 really often (every ~3min), while the m1 metric of sqs message queues are at a constant high m2 - number of instances m1 - number of messages in sqs queue e1 = m1/m2

Thank you!

1 Answer
0

The behavior you're experiencing is known as "thrashing" in the context of Auto Scaling, and it can occur with target tracking policies under certain conditions. In your case, it appears that the Auto Scaling group is rapidly scaling out and then immediately scaling in, causing instability in your service.

This thrashing behavior is likely due to a mismatch between the scaling metric (SQS queue depth divided by number of instances) and the actual workload processing capacity of your instances. When new instances are launched, they may not immediately start processing messages from the queue, leading to a high metric value that triggers further scale-out. Then, as instances begin to process messages, the metric suddenly drops, triggering a scale-in event.

To address this issue, consider the following adjustments:

  1. Increase the cooldown period for your Auto Scaling group. This will give new instances more time to start processing messages before another scaling decision is made.

  2. Implement instance protection for a short period after launch. This prevents new instances from being terminated immediately after they're launched, giving them time to start processing workload.

  3. Adjust your target tracking metric. Instead of using a simple ratio of queue depth to instance count, consider using a more complex metric that accounts for instance warm-up time or message processing rate.

  4. Implement step scaling instead of target tracking. This allows you to define more granular scaling actions based on different alarm thresholds, potentially reducing the likelihood of extreme scaling events.

  5. Review your application code to ensure that new instances can quickly begin processing messages from the queue. Slow startup times can contribute to scaling instability.

  6. Consider using ECS Capacity Providers with managed scaling if you're using Amazon ECS. This can provide more stable scaling behavior for container workloads.

By implementing these changes, you should be able to achieve more stable and predictable scaling behavior, avoiding the thrashing you've experienced. Remember to monitor your Auto Scaling group closely after making changes to ensure the new configuration meets your needs.
Sources
Autoscaling scheduled actions triggered successfully but did nothing | AWS re:Post
AWS AutoScaling group created many more instances than the configured max capacity. Why? | AWS re:Post
EC2 Autoscaling group launching more instances than required | AWS re:Post
Dynamic scaling for Amazon EC2 Auto Scaling - Amazon EC2 Auto Scaling

profile picture
answered 24 days ago
profile picture
EXPERT
reviewed 24 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions