- Newest
- Most votes
- Most comments
You are observing ECS's retry behavior. If ECS cannot start a Task in a Service because there are insufficient resources, ECS will retry scheduling the Task until it is successful (or the desired count is reduced, negating the need to schedule a replacement task).
Per AWS best practices, ECS implements an exponential backoff algorithm. The interval between retries grows longer and longer after each scheduling failure. So, if a Task has failed to schedule for some time, you may experience a significant delay after adding new capacity before the Task is rescheduled. 10-15 minutes is the maximum retry interval.
If you are intentionally terminating EC2 instances in order to save cost, it is recommended that you also reduce the Desired Count of your ECS services on the cluster so that they all fit. If you use EC2 Auto Scaling Capacity Providers with ECS, ECS will manage the ASG's desired count for you. Alternatively, you can use Fargate, which is serverless; you only pay for Tasks that are actively running.
Relevant content
- asked 5 months ago
- Accepted Answerasked 2 years ago
- Accepted Answerasked 2 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
Thanks Michael for this sound advice: 1) Don't leave an ECS Service with a desired count that can't be fulfilled by the ASG (will suffer backoff-retry and slow recovery), and 2) Prefer EC2 Auto Scaling Capacity Providers.
I've just done some testing and the Auto Scaling Capacity Provider works nicely when configured correctly. I wasn't yet on this path because a) starting CloudFormation didn't use it and b) in my specific case it was easier to detect/scale-to-zero the ASG directly than the ECS Service, without custom metrics.