Why is my Amazon ECS task stuck in the PENDING state?

8 minute read
0

My Amazon Elastic Container Service (Amazon ECS) task is stuck in the PENDING state.

Short description

The following scenarios cause Amazon ECS tasks to get stuck in the PENDING state:

  • The Docker daemon is unresponsive.
  • There's a resource constraint in the cluster.
  • The Docker image is large.
  • The Amazon ECS container agent lost connectivity with the Amazon ECS service in the middle of a task launch.
  • The Amazon ECS container agent is taking a long time to stop an existing task.
  • You didn't correctly configure your Amazon Virtual Private Cloud (Amazon VPC) routing.
  • An essential container depends on non-essential containers that aren't in the HEALTHY state.
  • The AWS Identity and Access Management (IAM) role that you associated with your Amazon ECS tasks are missing or incorrect.
  • There are image compatibility issues with the Windows version that you selected.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

The Docker daemon is unresponsive or there's a resource constraint in the cluster

In the task definition, check whether the task is requesting more memory or CPU than what the instance has capacity to support. Adjust your container instance resources based on your needs.

For CPU issues, complete the following steps:

  1. Use Amazon CloudWatch metrics to check whether your container instance exceeded the maximum CPU quota.
  2. Increase the size of your container instance as needed.

For memory issues, complete the following steps:

  1. Run the free command to see how much memory is available for your system.
  2. Increase the size of your container instance as needed.

For I/O issues, complete the following steps:

  1. Run the iotop command.
  2. Identify the tasks in each service that use the most input/output operations per second (IOPS).
  3. Use task placement constraints and strategies to distribute the tasks to different container instances.
    -or-
    Use CloudWatch to create an alarm for your Amazon Elastic Block Store (Amazon EBS) burst balance metrics. Then, use an AWS Lambda function or your own custom logic to balance tasks.

The Docker image is large

Large images take longer to download and increase the amount of time that the task is in the PENDING state.

To speed the transition time, tune the ECS_IMAGE_PULL_BEHAVIOR parameter to use image caching. For example, set the ECS_IMAGE_PULL_BEHAVIOR parameter to prefer-cached in /etc/ecs/ecs.config. If you use prefer-cached, then Amazon ECS remotely pulls the image when there's no cached image. Otherwise, Amazon ECS uses the cached image on the instance.

The Amazon ECS container agent lost connectivity with the Amazon ECS service in the middle of a launch

To verify the Amazon ECS container agent's status and connectivity, run the following commands on your container instance based on your Amazon Linux version.

Amazon Linux 1:

sudo status ecs
sudo docker ps -f name=ecs-agent

Amazon Linux 2:

sudo systemctl status ecs
sudo docker ps -f name=ecs-agent

If the status in the output is inactive, then the agent is disconnected. To resolve this issue, run the following commands to restart your container agent.

Amazon Linux 1:

sudo stop ecs
sudo start ecs

Amazon Linux 2:

sudo systemctl stop ecs
sudo systemctl start ecs

You receive an output that's similar to the following message:

ecs start/running, process abcd

To determine agent connectivity, check the following logs during the relevant timeframe for keywords such as error, warn, or agent transition state:

  • View the Amazon ECS container agent log at /var/log/ecs/ecs-agent.log.yyyy-mm-dd-hh.
  • View the Amazon ECS init log at /var/log/ecs/ecs-init.log.
  • View the Docker logs at /var/log/docker.

Use the information in the logs to identify the root cause of the connectivity issues.

Note: You can also use the Amazon ECS logs collector to collect general operating system (OS) logs, Docker logs, and container agent logs for Amazon ECS.

To pull local real-time task status in the container instance, run the following command to view the metadata of running tasks in your container instance:

curl http://localhost:51678/v1/metadata

You receive an output similar to the following example:

{  "Cluster": "CLUSTER_ID",
  "ContainerInstanceArn": "arn:aws:ecs:REGION:ACCOUNT_ID:container-instance/TASK_ID",
  "Version": "Amazon ECS Agent - AGENT "
}

In the output, make sure that the task environment variables, CPU, memory, and IAM role configuration are correct. Also, make sure that the task has the required secrets.

To view detailed information about all tasks running in the service, run the following command:

curl http://localhost:51678/v1/tasks

You receive an output similar to the following example:

{  "Tasks": [
    {
      "Arn": "arn:aws:ecs:REGION:ACCOUNT_ID:task/TASK_ID",
      "DesiredStatus": "RUNNING",
      "KnownStatus": "RUNNING",
      ... ...
    }
  ]
}

In the preceding command outputs, check whether there are differences between the local agent and the Amazon ECS service. Use this information to identify where and why the task is stuck.

The Amazon ECS container agent takes a long time to stop an existing task

When Amazon ECS sends new tasks to start from the PENDING state to the RUNNING state, the container agent might have existing tasks to stop. In this case, the agent doesn't start the new tasks until it stops the existing tasks first.

To control the container stop and start timeout at the container instance level, adjust the environment variables for the ECS_CONTAINER_STOP_TIMEOUT and ECS_CONTAINER_START_TIMEOUT variables in /etc/ecs/ecs.config. ECS_CONTAINER_STOP_TIMEOUT sets the amount of time that passes before Amazon ECS forcibly ends your containers if they don't exit on their own. The default stop timeout value for Linux and Windows is 30 seconds. ECS_CONTAINER_START_TIMEOUT sets the amount of time that passes before the Amazon ECS container agent no longer tries to start the container. The default start timeout value is 3 minutes for Linux and 8 minutes for Windows.

If your agent version is 1.26.0 or later, then you can define the stop and start timeout parameters in each task. Note that when you change the parameter, the task might change to the STOPPED state. For example, container instance A has a dependency for container instance B to reach a COMPLETE, SUCCESS, or HEALTHY state. You didn't specify a startTimeout value for container instance B. If container instance B doesn't reach the required state within the timeout time, then container instance A doesn't start.

For an example of container dependency, see Example: Container dependency on the GitHub website.

You didn't correctly configure your Amazon VPC routing

Check the configuration for the VPC subnet that your Amazon ECS or AWS Fargate tasks run in. Your subnet must have access to Amazon ECS or Amazon Elastic Container Registry (Amazon ECR). To resolve configuration issues, make sure that the route table for your subnet has an internet gateway or a NAT gateway. If you launch a task in a subnet that doesn't have an outbound route to the internet, then use AWS PrivateLink. This configuration allows you to access Amazon ECS APIs with private IP addresses.

Also, make sure that your security group rules allow inbound and outbound communication over your configuration's required ports.

An essential container instance depends on non-essential container instances that aren't in the HEALTHY state

If a non-essential container instance that an essential container instance depends on fails to be in a HEALTHY state, then your task becomes stuck in PENDING. You receive the "stoppedReason":"Service ABCXYZ: task last status remained in PENDING too long" message.

To resolve this issue, make sure that your non-essential container instances work as expected. If you can't resolve the underlying issue, then update the task definition for the container instances, and set the essential parameter to true. If the task is still stopped, then check the stopped reason. For more troubleshooting steps, see Why is my Amazon ECS task stopped?

The IAM role is missing or misconfigured

If the task is in a container instance that doesn't have the required permissions, then you receive an error that's similar to the following example:

"(service test) failed to launch a task with (error ECS was unable to assume the role 'arn:aws:iam::111111111111:role/test-fTa-T3J4hVnyL53E' that was provided for this task. Please verify that the role being passed has the proper trust relationship and permissions and that your IAM user has permissions to pass this role.)"

To resolve this issue, make sure that the container instance has the required permissions.

Also, if you don't use an Amazon ECS optimized Amazon Machine Image (AMI) for your container instances, then check your Amazon ECS agent configurations.

There are image compatibility issues with the Windows version that you selected

Tasks fail when the image that you use in Windows Fargate tasks isn't compatible with your platform. To check whether your image is compatible with the Windows server host, see Windows container version compatibility on the Microsoft website. Then, check the prerequisites to run the Windows tasks.

Also, make sure that the image URL that you defined is accurate.

Related information

Container dependency

amazon-ecs-agent on the GitHub website

AWS OFFICIALUpdated 5 months ago