How do I troubleshoot scaling issues with my Amazon ECS capacity provider?

7 minute read
0

I created a capacity provider for my Amazon Elastic Container Service (Amazon ECS) cluster with the Amazon Elastic Compute Cloud (Amazon EC2) launch type. However, the capacity provider doesn't scale as expected.

Short description

The following issues cause your Amazon EC2 capacity provider not to automatically scale in or scale out:

  • You didn't associate the Amazon ECS service with the capacity provider.
  • You didn't attach the capacity provider scaling policies to the Amazon EC2 Auto Scaling group.
  • You didn't correctly configure the target capacity percentage.
  • You're using managed scaling for the capacity provider, and custom scaling policies are attached to the EC2 Auto Scaling group.
  • The Amazon EC2 Auto Scaling group launched the container instance but can't join the cluster.
  • Your container instances can't scale in or down.
  • The capacity provider is stuck in the Failed state.
  • The Auto Scaling group is stuck in a scaling loop.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

Check that you associated the Amazon ECS service with the capacity provider

To confirm whether you associated the Amazon ECS service with the capacity provider, run the following describe-services AWS CLI command:

aws ecs describe-services --cluster example-cluster --services example-service --region example-region --query 'services[].capacityProviderStrategy'

Note: Replace example-cluster with your cluster name, example-service with your service name, and example-region with your AWS Region.

If you associated the Amazon ECS service with the capacity provider, then you receive an output that's similar to the following example:

[  
  [
    {
      "capacityProvider": "example-capacity-provider",
      "weight": 1,
      "base": 1
    }
  ]
]

Make sure that the capacityProviderStrategy value isn't [].

To add a capacity provider to the service, run the following update-service command:

aws ecs update-service --cluster example-cluster --service example-service --region example-region --capacity-provider-strategy capacityProvider=capacity-provide-name,weight=weight-value,base=base-value --force-new-deployment

Note: Replace example-cluster with your cluster name, example-service with your service name, example-region with your Region, and capacity-provider-name with your capacity provider name. Also, replace weight-value with the total tasks that use the capacity provider and base-value with the minimum number of tasks for the capacity provider to run.

You can also use the Amazon ECS console to update the service.

Make sure that you attached the capacity provider scaling policies to the Auto Scaling group

When you associate a capacity provider with an Auto Scaling group, Amazon ECS creates a scaling policy that modifies capacity based on the cluster load.

To troubleshoot scaling policy issues, review the AWS CloudTrail events for PutScalingPolicy, UpdateAutoScalingGroup, CreateCapacityProvider, and UpdateCapacityProvider API calls. Make sure that the policy can associate with the Auto Scaling group, and that the capacity provider is working as expected.

To verify that the Auto Scaling group is a cluster attachment, run the following describe-cluster command:

aws ecs describe-clusters --clusters example-cluster --include ATTACHMENTS --region example-region --query 'clusters[].attachments[]'

Note: Replace example-cluster with your cluster name, and example-region with your Region.

Example output:

[  
  {
    "id": "100a23456-5f0b-4abc-b998-d6789d111a",
    "type": "as_policy",
    "status": "CREATED",
    "details": [
      {
        "name": "capacityProviderName",
        "value": "example-capacityProvider"
      },
      {
        "name": "scalingPlanName",
        "value": "ECSManagedAutoScalingPlan-bb60c8fa-3ed7-4808-b39c-abcdef2345"
      }
    ]
  }
]

If you use a managed scaling policy, then complete the following steps to check whether you attached the policy to the Auto Scaling group:

  1. Open the Amazon ECS console.
  2. In the navigation pane, choose Clusters.
  3. Select your cluster.
  4. Choose the Infrastructure tab.
  5. Choose the Capacity providers tab.
  6. Select your Auto Scaling group.
    Note: This action redirects you to the Auto Scaling groups page in the Amazon EC2 console.
  7. Choose the Automatic scaling tab.
  8. Choose Actions, and then select Edit dynamic scaling policy.
  9. In the Custom metric JSON field, check that the policy includes the CapacityProviderReservation metric.

Check your target capacity percentage configuration

Check the CapacityProviderReservation Amazon CloudWatch metric for your capacity provider to track the usage of its container instances. The target tracking scaling policy that's associated with the Auto Scaling Group adjusts the number of running instances to make sure that CapacityProviderReservation matches the target capacity value. For example, if you set the target capacity to 100%, then Amazon ECS uses all instances and scales in instances that aren't running tasks.

To set up extra capacity, update Set target capacity to a value that's lower than 100.

Make sure that the instance that launched from the Auto Scaling group can join the cluster

If your instance can't join the cluster, then see Why can't my Amazon EC2 instance join the Amazon ECS cluster?

Make sure that your container instances aren't protected from scale in or scale down actions

For capacity providers that use managed termination protection, Amazon ECS blocks the termination of Amazon EC2 instances with tasks during a scale in action.

To stop all running tasks and allow the Auto Scaling group to terminate the EC2 instance, use the Amazon ECS console to drain the instance. Or, run the following update-container-instances-state command:

aws ecs update-container-instances-state --cluster example-cluster --container-instances example-container --status DRAINING --region example-region

Note: Replace examples-cluster with you cluster name, example-container with your container instance, and example-region with your Region.

If the tasks still run on the container instance after you drain, then see How do I troubleshoot Amazon ECS tasks that take a long time to stop when the container instance is set to DRAINING?

To further troubleshoot managed termination protection issues, see How do I resolve the managed termination protection setting for the capacity provider error in Amazon ECS?

If scaling protections block scale down actions in your instance, then you receive the following error message in the Auto Scaling activity history:

"Could not scale to desired capacity because all remaining instances are protected from scale-in."

To resolve this issue, check your tooling or third-party tools, such as Terraform or GitLab. Make sure that they don't remove the AmazonECSManaged tag from the Auto Scaling group. Amazon ECS requires this tag to manage scaling. To check whether the AmazonECSManaged tag is missing, check your CloudTrail event history for the SetInstanceProtection event. If you see SetInstanceProtection, then you must add the tag back to your Auto Scaling group.

Check the status of your capacity provider

When you use a capacity provider, it's a best practice to create a new Auto Scaling group and not reuse an existing group. Instances in the Running state that are associated with the existing group and registered to an Amazon ECS cluster might not correctly register.

To view the status of the capacity provider, run the describe-capacity-providers command:

aws ecs describe-capacity-providers \ 
--capacity-providers MyCapacityProvider

If the capacity provider status is INACTIVE, then the capacity provider was deleted.

Also, review the CloudTrail events for errors that are related to the CreateCapacityProvider API.

Make sure that the Auto Scaling group isn't stuck in a scaling loop

When the target capacity that you specified in your Amazon ECS service scaling policy spikes, the Auto Scaling group scales out and launches instances. However, if the metric value drops after the sudden spike, then the Auto Scaling group scales in the instances. If the target capacity frequently fluctuates within a short period of time, then the Auto Scaling group gets stuck in a scaling loop. To avoid this issue, configure the target capacity value to match your workload.

Related information

Deep dive on Amazon ECS cluster auto scaling

How do I resolve errors when I delete a capacity provider in Amazon ECS?

Amazon ECS clusters for the AWS Fargate launch type

AWS OFFICIAL
AWS OFFICIALUpdated 4 months ago