FAQs: Amazon ECS tasks that are stuck in a lifecycle state

7 minute read
0

I want to know why my Amazon Elastic Container Service (Amazon ECS) task is stuck in a particular lifecycle state.

Q: Why is my Amazon ECS task stuck in the PROVISIONING state?

When the ECS scheduler can't find a suitable ECS instance in the Amazon Elastic Compute Cloud (Amazon EC2) launch type, a task that uses the launch type might remain in the PROVISIONING state. This happens even if space is available on existing container instances for the task to run.

Tasks in the PROVISIONING state form a queue. Amazon ECS adds tasks to the queue that it can't deploy because of resource constraints related to memory, ports, elastic network interfaces, GPU, and CPUs. As additional resources become available, these tasks transition to the RUNNING state.

If you manage scaling for the capacity provider, then tasks don't immediately fail when they can't get sufficient resources on existing instances. Instead, the tasks enter the PROVISIONING state. If you don't manage scaling, then tasks that can't find capacity immediately fail.

When issues occur and tasks in the awsvpc mode can't provision elastic network interfaces, they might remain in the PROVISIONING state. AWS Fargate tasks might also remain in the PROVISIONING state when issues occur and they can't provision elastic network interfaces. 

Q: Why do my Amazon ECS tasks transition from the PROVISIONING state to the STOPPED state when they use a capacity provider?

An ECS might get stuck in the PROVISIONING state because of the following reasons:

  • Configuration errors
  • Resource constraints
  • Misalignment with the capacity provider's specifications

To troubleshoot these issues, take the following actions:

  • Check if the Auto Scaling group that's associated with the capacity provider reached its maximum instance count. If the limit is reached, ECS can't provision additional tasks.

  • If you're using the Amazon EC2 launch type with a launch template that's associated with your Auto Scaling group, then check the user data. Make sure that you have the correct cluster name in the following format:

    #!/bin/bash 
    echo "ECS_CLUSTER=MyCluster" >> /etc/ecs/ecs.config
  • Make sure that the ecsInstanceRole policy with the correct trust relationship is attached to the instances that are associated with your Auto Scaling group.

  • Check the events of the Amazon ECS service. If you see the following error, then the instance is launched but not set up for the attached load balancer:
    Target is in an Availability Zone that is not enabled for the load balancer

  • Make sure that the designated subnets for the task and the container instances that the capacity provider manages are in the same Availability Zone. They must be in the same Availability Zone for tasks that you launch with a capacity provider strategy in the aws-vpc network mode.

  • For tasks that use the awsvpc network mode on EC2 Linux instances, elastic network interfaces aren't provided with public IP addresses. To access the internet, you must launch tasks in a private subnet that's configured to use a NAT gateway. Tasks that you launch in public subnets don't have access to the internet and might be stuck in the PROVISIONING state.

For more information, see View Amazon ECS service event messages.

Q. Why is my Amazon ECS task stuck in the PENDING state?

When Amazon ECS is waiting for the container agent to take further action, your task enters the PENDING state. The task stays in this state until resources are made available for the task.

For more information, see Why is my Amazon ECS task stuck in the PENDING state?

Q: Why is my Amazon ECS task stuck in the ACTIVATING state?

When the main process in the containers are transitioned to the RUNNING state and Amazon ECS performs additional steps, your task enters the ACTIVATING state. For example, you must create service discovery resources for tasks that use service discovery.

For tasks that are part of a service that uses multiple Elastic Load Balancing target groups, the target group registration occurs during this state. When the task is in the ACTIVATING state, Amazon ECS waits for corresponding actions to complete at the target group registration or service discovery resources.

Q: Why is my task that uses service discovery stuck in the ACTIVATING state?

Check the AWS CloudTrail Event history for the corresponding service discovery service-related API calls, such as the following ones:

  • CreatePrivateDnsNamespace
  • CreateService
  • RegisterInstance
  • UpdateInstanceCustomHealthStatus

The preceding API actions are defined in the AmazonECSServiceRolePolicy AWS Managed policy. If the API call failed with an error code, such as SeviceNotFound or InternalFailure, then make sure that you follow the service discovery considerations.

Q: Why is a task that uses the Elastic Load Balancing target group stuck in the ACTIVATING state?

When Amazon ECS takes a long time to determine the health of the container, the task might be stuck in the ACTIVATING state. To calculate the total time it takes, multiply the parameters HealthCheckIntervalSeconds and HealthyThresholdCount. To quicken the health-check process, reduce the number of checks and the interval between checks. Check the CloudTrail events to see if any errors or failures are reported for the corresponding Elastic Load Balancing RegisterTargets API calls.

Q: Why is my Amazon ECS task stuck in the RUNNING state when it's supposed to be in the STOPPED state?

Issues with the application or a misconfigured task can cause your task to remain in the RUNNING state. 

To troubleshoot these issues, take the following actions:

  • Review the application logs for any errors or information about why the container doesn't exit. After the essential container exits, the task transitions to the STOPPED state.
  • If the task is part of an Amazon ECS service, then verify that your DeploymentConfiguration parameters are correctly set.
  • If the task is a part of an Amazon ECS service that uses a load balancer, then verify that the deregistration delay parameter is correctly set.
  • Check whether the ECS_CONTAINER_STOP_TIMEOUT value is correctly set.
  • Turn on Amazon ECS exec for your task. Then, use ECS exec to log in to the container to troubleshoot the issue on the application.

For more information, see How do I troubleshoot Amazon ECS tasks that take a long time to stop when the container instance is set to DRAINING?

Q: Why is my Amazon ECS task stuck in the DEACTIVATING state?

In the DEACTIVATING state, Amazon ECS performs additional steps before the task is stopped. For example, for tasks that are part of a service that uses multiple Elastic Load Balancing target groups, the target group deregistration occurs during this state.

The following reasons can cause your task to remain in the DEACTIVATING state:

  • You delete the resources that are related to the Elastic Load Balancing target groups before the task can deregister from the target groups.
  • The custom role that's specified as a parameter when you create the Amazon ECS service doesn't have the required Elastic Load Balancing permissions.
  • You delete the custom role when the task is active.

Verify that the target group and AWS Identity and Access Management (IAM) role that you specified in the service definition exist. Also, make sure that the service definition includes the required trust policy and Elastic Load Balancing permissions.

To optimize the time it takes for a task to move from the DEACTIVATING state, tune the deregistration_delay.timeout_seconds and deregistration_delay.timeout_seconds parameters. For more information, see Load balancer connection draining.

Q: Why is my Amazon ECS task stuck in the DEPROVISIONING state?

Amazon ECS performs additional steps after the task stops but before the task transitions to the STOPPED state. For example, for tasks that use the awsvpc network mode, Amazon ECS detaches and deletes the elastic network interface.

The following reasons can cause your task to remain in the DEPROVISIONING state:

  • You deleted the service-linked roles that are associated with the service before the task is terminated. The service predefines the service-linked roles and include all permissions that the service requires to call other AWS services on your behalf. Run the following command to check if a service-linked role is associated with your service:

    $aws ecs describe-services --cluster CLUSTER-NAME --services SERVICE-NAME
  • You deleted the target group before the tasks deregister from the target group.

AWS OFFICIAL
AWS OFFICIALUpdated 25 days ago