ECS Capacity Provider Auto-Scaler Instance Selection
I am working with AWS ECS capacity providers to scale out instances for jobs we run. Those jobs have a large variation in the amount of memory that is needed per ECS task. Those memory needs are set at the task and container level. We have a capacity provider that is connected to an EC2 auto scaling group (asg). The asg has the instance selection so that we specify instance attributes. Here we gave it a large range for memory and cpu, and it shows hundreds of possible instances.
When we run a small job (1GB of memory) it scales up a
m6i.large instance and the job runs. This is great because our task runs but the instance it selected is much larger than our needs. We then let the asg scale back down to 0. We then run a large job (16GB) and it begins scaling up. But it starts the same instance types as before. The instance types have 8GB of memory when our task needs double that on a single instance.
In the case of the small job I would have expected the capacity provider to scale up only 1 instance that was closer in size to the memory needs to the job (1GB). And for the larger job I would have expected the capacity provider to scale up only 1 instance that had more than 16GB of memory to accommodate the job (16GB).
- Is there a way to get capacity providers and autoscaling groups to be more responsive to the resource needs of the pending tasks?
- Are there any configs I might have wrong?
- Am I understanding something incorrectly? Are there any resources you would point me towards?
- Is there a better approach to accomplish what I want with ECS?
- Is the behavior I outlined actually to be expected?
I am going to assume in my answer that these tasks are in different definitions (i.e. you have a task definition for small jobs and another / several other one(s) for big jobs needing the 16GB of RAM).
In which case, I would almost simply create a different ASG (used as different cluster capacity providers) and set the capacity provider "per service" (i.e. have one for tasks from 1 to 8GB of RAM, another one for 16GB of RAM and more, and if need be, one for in-between). Given the ASG is driven by ECS, you either only have to have 1 Launch Template per ASG or 1 Launch Template re-used with overrides (3 different ones is probably best).
Now I get that, given you had the placement strategy, that this seems overkill, but certainly it would prevent ECS to deploy small instances in a race condition (i.e, get an instance type for small tasks first) and instead clearly isolate the ASG for the bigger jobs.
This also could help future proofing configuration: if you needed bigger disks for the bigger jobs, or GPUs, then you only have the one LT/ASG to change (and reflect into the capacity provider) instead of changing it for all of them or paying GPU instances for tasks that do not need these.
The good thing of capacity providers is that you can define these both into the cluster and "override" at the service level, so you could have 4 capacity providers in your cluster, default to one of them (base/weight assignment) and at the same time have some services specifically use their own configuration.
Hope this helps :)
For example, using x-ecs you can do the following.
services: small-jobs-service: deploy: resources: limits: memory: 2GB reservations: memory: 1GB x-ecs: CapacityProviderStrategy: - CapacityProvider: SMALL-ASG Base: 1 Weight: 2 - CapacityProvider: MEDIUM-ASG Base: 4 Weight: 8 medium-jobs-service: # No override, use cluster defaults. deploy: resources: reservations: memory: 6GB big-jobs-service: deploy: resources: reservations: memory: 12GB x-ecs: CapacityProviderStrategy: - CapacityProvider: BIG-ASG Base: 1 Weight: 2
+1 to this suggestion. The capacity provider doesn't tell the ASG anything about the instance requirements, it just increases the CapacityProviderReservation metric so that the ASG scales the correct amount of total capacity. This means there's no way for ECS to influence which instances get picked (and in reality, the ASG doesn't actually control this either, it passes along the settings of the MixedInstancePolicy and EC2 picks the instance types)
This is very helpful. I think I was running with the assumption that the resources needs were being propagated from ECS tasks to the asg. This suggestions seems like a good way to achieve what I am looking for. Thank you!
ECS Capacity providers best practicesAccepted Answerasked 2 months ago
Check instance capacityasked 18 days ago
scale in protection setting in auto scaling group is ignoredasked 3 months ago
Should ECS/EC2 ASGProvider Capacity Provider be able to scale-up from zero, 0->1Accepted Answerasked 6 months ago
ECS: Understanding of CapacityProviderReservationasked 3 months ago
ECS Capacity Provider Auto-Scaler Instance SelectionAccepted Answerasked 18 days ago
ECS: Capacity Provider vs Autoscaling Groupasked 7 months ago
ECS + Spot Integration - Multiple ASGs vs SpotFleetAccepted Answerasked 4 years ago
ECS services not scaling in (scale in protection is disabled)asked 18 days ago
Tuning/Optimisation of ECS for AWS Batch for very short lived jobs.asked 2 months ago