ECS cluster with EC2 capacity provider is left with instances having only daemon task running

0

I have an ECS cluster backed by an EC2 capacity provider. The cluster has two services: a replica doing the actual work, and a daemon running the Datadog log collection agent. The replica service is autoscaled with a target tracking policy tracking CPU utilisation at 80%. The capacity provider is configured with managed scaling with target capacity 100%, and managed termination protection is enabled.

My understanding is that the capacity provider managed scaling turns instance protection on when adding instances and off when there are no longer any replica tasks running on an instance - daemons task are not counted. But what I see in my cluster are some ECS instances with only the single daemon task running on them. WhenI go to the autoscaling group, I see that associated EC2 instance has "Protected from: scale in". Why is it like this and is this what's causing EC2 instances with single daemon tasks to not be terminated?

  • I think, but I'm not sure, what may have been the issue, is that I was manually killing tasks myself, rather than the ECS scaling doing it, so perhaps that means the scale-in protection wasn't being removed.

1 Answer
0

I see that associated EC2 instance has "Protected from: scale in". Why is it like this and is this what's causing EC2 instances with single daemon tasks to not be terminated?

Yes, the instance being protected from scale-in makes it so it won't be terminated (except if an issue is detected, for example a healthcheck fails, or an explicit termination request is sent for that instance)

As for why, there could be multiple reasons, a few common ones might be

  • Something else (other than ECS) re-enabled it. Check Cloudtrail for SetInstanceProtection calls, and look if any aren't made by ECS.
  • ECS tried to disable protection, but there was API throttling. This can again be seen in CloudTrail by searching for SetInstanceProtection calls and seeing if any of them show a RateExceeded error. If this is happening, check if you have any scripts making large amounts of AutoScaling API calls that can be reduced. If not, then open a case with support to evaluate increasing the API limit
  • Verify that at the group level, the ASG (AutoScaling Group) has scale-in protection enabled, so that new instances have it on by default (ECS requires this to be enabled when you enabled Managed Termination Protection, but it may have later been toggled off)

Additionally, its possible for instances to still not get terminated when protection is removed. For an instance to be terminated, the Desired Capacity of the ASG has to go down (generally through a scaling policy lowering it). If the desired hasn't gone down, then instances won't be scaled in, even if the capacity provider has removed scale-in protection

answered 20 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions