Hi there,
I've run into a particular issue with ECS capacity providers & auto-scaling that keeps coming up, and I believe it's a bug but not entirely sure.
The situation is: we have two ECS clusters where we deploy our applications, one for production and one for staging. We use an ECS capacity provider backed by an EC2 autoscaling group. We have managed termination protection enabled on the autoscaling group, and managed termination protection, draining, and scaling enabled on the capacity provider so our autoscaling group scales up and down depending on how many tasks we're trying to run. Some of our services have auto-scaling enabled as well, so when there's an increase in demand those services scale up which triggers the autoscaling group provisioning more instances to run the additional tasks on. When setting these resources up from scratch, everything works as expected: the cluster scales up appropriately due to increases in demand, and it scales back down when the demand decreases. I've watched the process happen manually--when our services scale down, after 15 minutes the cloudwatch alarm for scaling the autoscaling cluster down triggers. Initially the autoscaling provider fails to scale down because termination protection is still enabled on all of the instances, but shortly afterwards the ECS cluster removes the termination protection from the instances that should be terminated, at which point they go into the Termiation:Wait status in the autoscaling group, at which point ECS drains them, and then shortly afterwards they're finally terminated.
However, what I've found pretty consistently is that after some period of time, ususally between a few days and a few weeks, the scaling down ceases to work altogether. The autoscaling group's desired capacity decreases, but ECS never removes the termination protection from any instances and they're never terminated as a result. I've tried manually removing the termination protection from instances that I can see don't have any tasks running, but then they go into a "Draining" state in ECS and stay there seemingly indefinitely with 0 tasks running. The only way I've found to get rid of the instances is to manually terminate them. In addition, I've found that the instances that are "in limbo", meaning they should have been deleted but weren't, appear to be in some kind of bad state where ECS no longer tries to schedule tasks on them--if something scales up there will be tasks that remain in the "Provisioning" state forever unless I do something. If I manually terminate the instances, new ones will come up and tasks will be placed on them, but then when the cluster tries to scale back down but fails those new instances will no longer be "usable" without being manually terminated.
Another possibly relevant piece of information is that this appears to be new; we've been running our ECS clusters for a few years, but when the "Managed Draining" feature came out in the last few months I recreated our capacity providers and auto scaling groups (enabling this setting on an existing capacity provider didn't seem to work), and since then we've seen this issue consistently. Previously we never saw the issue--even though we already had "Managed Termination Protection" and "Managed Scaling" enabled before as well. Once again, I did confirm right after recreating them that the scaling was working--it was only a few days later that it got into this state. The first time it happened, I recreated them again and it worked initially but today I've found that the same issue is happening.
Overall we've been very happy with ECS, but this has really been a big thorn in our side. Has anyone else experienced this or have any insight into what might be going wrong or additional steps I can take to troubleshoot the issue?
EDIT: I've already been through all of the point here: https://repost.aws/knowledge-center/ecs-capacity-provider-scaling and everything is as it should be--the capacity provider's status is ACTIVE, and the relevant "attachments" on the cluster are all there. One thing it mentions is that if the capacity provider and cluster are created at separate times that can result in the capacity provider being in a bad state, which is why the last time this happened I tried recreating the capacity provider and cluster, which did work temporarily. However, that's a relatively involved thing to do w/ how we manage our infrastructure and recreating our cluster and capacity provider every few days or weeks is not a workable solution for us.
I know that I could just disable termination protection altogether and it's possible that would fix the issue, and likely that's what I'll do eventually if I can't find a better solution.