ECS capacity provider keeps EC2 instances with 0 running tasks

0

We have

  • an ECS cluster.
  • two ECS Capacity Providers to run EC2 and EC2 Spot instances. Both providers have Target Capacity = 100%, enabled "Managed scaling", enabled "Managed instance protection", and enabled "Managed draining".
  • two ASGs linked to capacity providers. "Protected from Scale in" is enabled on both ASGs.
  • a few ECS Services that use both ECS Capacity Providers. The weights and bases are as follows:
    Capacity provider | Capacity provider weight | Capacity provider base
    EC2 provider      | 1                        | 1
    EC2 spot provider | 3                        | 0 
  • each ECS Service also has an Auto Scaling policy that tracks ALBRequestCountPerTarget metric. "Scale-in" in the policy is also enabled.

Due to the configuration described above, we expected that AWS would start new EC2 instances only when ECS was required to start new ECS Tasks. If EC2 instances didn't run any ECS tasks, AWS would drain and terminate those EC2 instances.

However, we faced two issues:

  1. Sometimes, ECS starts new EC2 instances even though Auto Scaling Policies are not in alarm. Those EC2 instances might remain with 0 running tasks for a few days/weeks.
  2. "Desired size" of ASG is sometimes bigger than "Current size". That also might happen even if Auto Scaling Policies are not in alarm, all EC2 instances are running tasks, and ECS is not about to run new tasks.

Why does ECS start new EC2 instances when it's not required to run new tasks?

Why can't ECS terminate EC2 instances with 0 running tasks?

1 回答
1

Q: Why can't ECS terminate EC2 instances with 0 running tasks? ECS starts new EC2 instances even though Auto Scaling Policies are not in alarm.

  • One quick bit of background clarification - ECS itself shouldn't ever be directly launching new instances. It should be changing the CapacityProviderReservation metric, which will trigger the CloudWatch alarm, which triggers the ASG scaling policy. That's the only method ECS should have to indirectly cause the launch/termination of instances
  • If instances are being launched some other way, then its possible they're not being registered with the capacity provider correctly, and its not calculating the cluster size correctly to be able to scale-in
  • ASG usually only scales when the desired is changed through a scaling policy, but there could be other times is launches or terminates. I'd suggest going into the Activity History of the ASG to see the reason for the launch/terminate events. I'd guess in your case its from Spot instances being reclaimed by EC2, and the ASG replacing them. If you don't have Capacity Rebalance enabled on the ASG, then the instances will be terminated and not gracefully drained. With this disabled, the activity history message will show the instances were replaced due to failing EC2 healthchecks
  • Are weights set on the ASG? If so, this isn't supported by ECS and will cause scaling issues, since the capacity provider is assuming they're not configured

Q: "Desired size" of ASG is sometimes bigger than "Current size". That also might happen even if Auto Scaling Policies are not in alarm

  • ASG will always try to meet the desired, and will keep retrying if there's launch (or terminate) failures
  • Check the Activity History of the ASG to see if there are launch failures. Since you're using Spot, my guess is you'll find launch failures saying there's no capacity. This means the ASG has asked EC2 for all your configured instance types, and none of them had capacity at the time. If you don't already, we recommend at least 10 different instance types when using spot. This is a very general recommendation, and you might need more for some regions, instance types, or if its a large workload. Use the Spot Placement Score (SPS) as a better rough guide to see if you have enough instance types in the ASG.

Those are some general answers based on your setup, but in the end its hard to tell exactly what's happening without seeing the resources themselves, so you might be best off opening a support case for a more exact answer.

AWS
已回答 9 天前
  • Thank you for your answer! Our ASGs had instance weighting settings. We've removed them and are now observing whether it helps

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则