Skip to content

ECS Managed Termination Protection stops working after a while

0

Hi there, I've run into a particular issue with ECS capacity providers & auto-scaling that keeps coming up, and I believe it's a bug but not entirely sure.

The situation is: we have two ECS clusters where we deploy our applications, one for production and one for staging. We use an ECS capacity provider backed by an EC2 autoscaling group. We have managed termination protection enabled on the autoscaling group, and managed termination protection, draining, and scaling enabled on the capacity provider so our autoscaling group scales up and down depending on how many tasks we're trying to run. Some of our services have auto-scaling enabled as well, so when there's an increase in demand those services scale up which triggers the autoscaling group provisioning more instances to run the additional tasks on. When setting these resources up from scratch, everything works as expected: the cluster scales up appropriately due to increases in demand, and it scales back down when the demand decreases. I've watched the process happen manually--when our services scale down, after 15 minutes the cloudwatch alarm for scaling the autoscaling cluster down triggers. Initially the autoscaling provider fails to scale down because termination protection is still enabled on all of the instances, but shortly afterwards the ECS cluster removes the termination protection from the instances that should be terminated, at which point they go into the Termiation:Wait status in the autoscaling group, at which point ECS drains them, and then shortly afterwards they're finally terminated.

However, what I've found pretty consistently is that after some period of time, ususally between a few days and a few weeks, the scaling down ceases to work altogether. The autoscaling group's desired capacity decreases, but ECS never removes the termination protection from any instances and they're never terminated as a result. I've tried manually removing the termination protection from instances that I can see don't have any tasks running, but then they go into a "Draining" state in ECS and stay there seemingly indefinitely with 0 tasks running. The only way I've found to get rid of the instances is to manually terminate them. In addition, I've found that the instances that are "in limbo", meaning they should have been deleted but weren't, appear to be in some kind of bad state where ECS no longer tries to schedule tasks on them--if something scales up there will be tasks that remain in the "Provisioning" state forever unless I do something. If I manually terminate the instances, new ones will come up and tasks will be placed on them, but then when the cluster tries to scale back down but fails those new instances will no longer be "usable" without being manually terminated.

Another possibly relevant piece of information is that this appears to be new; we've been running our ECS clusters for a few years, but when the "Managed Draining" feature came out in the last few months I recreated our capacity providers and auto scaling groups (enabling this setting on an existing capacity provider didn't seem to work), and since then we've seen this issue consistently. Previously we never saw the issue--even though we already had "Managed Termination Protection" and "Managed Scaling" enabled before as well. Once again, I did confirm right after recreating them that the scaling was working--it was only a few days later that it got into this state. The first time it happened, I recreated them again and it worked initially but today I've found that the same issue is happening.

Overall we've been very happy with ECS, but this has really been a big thorn in our side. Has anyone else experienced this or have any insight into what might be going wrong or additional steps I can take to troubleshoot the issue?

EDIT: I've already been through all of the point here: https://repost.aws/knowledge-center/ecs-capacity-provider-scaling and everything is as it should be--the capacity provider's status is ACTIVE, and the relevant "attachments" on the cluster are all there. One thing it mentions is that if the capacity provider and cluster are created at separate times that can result in the capacity provider being in a bad state, which is why the last time this happened I tried recreating the capacity provider and cluster, which did work temporarily. However, that's a relatively involved thing to do w/ how we manage our infrastructure and recreating our cluster and capacity provider every few days or weeks is not a workable solution for us.

I know that I could just disable termination protection altogether and it's possible that would fix the issue, and likely that's what I'll do eventually if I can't find a better solution.

1 Answer
0

If anyone ever comes across this question and has the same issue, I thought I'd post what I ended up doing.

As far as I know this is still an issue; when I first create a capacity provider it seems to scale up and down without issue. However, after some period of time (usually a couple of weeks from what I've seen) the cluster seems to get into some sort of state where it thinks something is running on instances when it actually isn't, and it won't size them down as a result.

However, if you configure your capacity providers so that Managed Termination Protection is disabled and Managed Draining is enabled, it significantly mitigates the issue. What happens in this case is that when the desired number of instances decreases, the instances that should be terminated will hit the managed draining "lifecycle hook". My understanding of the expected behavior here is that the instance in the autoscaling group goes into a "Terminating:Wait" status, ECS sets the instance to "draining", waits until no more tasks are running on it, and then the lifecycle hook "completes" and the autoscaling group finally terminates the instance. However, because of the behavior described in the original question and above, the behavior I see after the cluster gets into this "bad state" after a couple of weeks is that those instances do go into the "Terminating:Wait" status in the autoscaling group, do get set to "Draining" in ECS and drain any tasks that are running on them, BUT ECS never actually decides that they're done draining. That said, the lifecycle hook on the autoscaling group has a time out of 1 hour, so after that period of time they'll be terminated no matter what.

As a result, this setup allows better behavior since the cluster will always eventually fix itself--it basically just means that any instance will sit idle for 1 hour before it's terminated. Probably depends on the specifics of your situation to determine if this is good enough in terms of cost or behavior for your use-case, but it's good enough for us at the moment.

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.