How do I troubleshoot my failed Amazon ECS deployments?

5 minute read
0

I want to troubleshoot why my Amazon Elastic Container Service (Amazon ECS) deployment fails.

Short description

When you try to deploy a new application, Amazon ECS might encounter deployment gridlock. The older tasks continue to deploy, but new tasks can't reach a stable state. Your Amazon ECS deployments might fail for the following issues:

  • Resource constraints
  • Container image issues
  • Network issues
  • Task definition issues
  • Amazon CloudWatch Logs
  • Health check configuration

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

Resource constraints

Your Amazon ECS service deployment might fail when there isn't enough CPU or memory capacity available where you deploy your task. A resource constraint occurs when the RESOURCE:* error is in your service events and the Amazon ECS tasks don't transition to Running state.

To resolve this issue, take the following actions:

  • Use Amazon CloudWatch metrics to monitor your Amazon ECS cluster resource capacity.
  • Set up Container Insights for detailed task and container level resource usage metrics.
  • Make sure that your Auto Scaling group settings align with your workload requirements for your Amazon Elastic Compute Cloud (Amazon EC2) launch type clusters. Monitor instance capacity and configure scaling policies based on CPU usage, memory usage, or custom metrics.
  • Review your elastic network interface capacity for your Amazon ECS task for your awsvpc network mode. Use instance types with higher elastic network interface limits for dense deployments. Each task requires its own elastic network interface. Use CIDR planning and monitor your elastic network interface to make sure that sufficient IP addresses are available where you launch your Amazon ECS task.

To check for available resources on container instances, run the list-container-instances AWS CLI command:

aws ecs list-container-instances --cluster your-cluster-name

Note: Replace your-cluster-name with your cluster's name.

Then, run the describe-container-instances command:

aws ecs describe-container-instances --cluster your-cluster-name --container-instances container-instance-id

Note: Replace your-cluster-name with your cluster's name and container-instance-id with your container's instance ID.

Container image issues

When Amazon ECS can't pull images from a resource repository, the tasks fail with a CannotPullContainerError error in your service events. You might see related tasks in the agent logs of your container instance.

To resolve this error, confirm the following task definition configurations for your Amazon ECS task and the networking configuration for the service:

  • Configure the container image URI.
    For Amazon Elastic Container Registry (Amazon ECR), confirm that the image matches the following file naming format:

    account-id.dkr.ecr.region.amazonaws.com/repository-name:tag

    Note: Replace account-id with your account id, region with your AWS Region, and repository-name with your repository's name.

    For Docker Hub, confirm that the image matches the following file naming format:

    repository/image:tag

    Note: Replace repository with your repository's name.

  • Verify that the Amazon ECS task execution IAM role has the AmazonECSTaskExecutionRolePolicy permission attached.

  • Confirm that there aren't missing or incorrect environment placeholders in your task definition.

  • When you deploy the Amazon ECS service in a private subnet, associate the VPC endpoints within the same subnet and security group.
    For Interface endpoints:
    For Amazon ECS, use com.amazonaws.region.ecs.
    For CloudWatch Logs, use com.amazonaws.us-east-1.logs.
    For Amazon ECR (Docker), use com.amazonaws.us-east-1.ecr.dkr.
    For Amazon ECR API, use com.amazonaws.us-east-1.ecr.api.
    For Gateway endpoints:
    For Amazon S3, use com.amazonaws.us-east-1.s3.

Network issues

Network issues occur when containers can't communicate with external services, service discovery fails, or tasks can't reach required endpoints. You might receive timeout errors in the application logs, DNS resolution failures, or connectivity issues between containers and other AWS services. To resolve this issue, take the following actions.

  • Verify that the inbound rules for your Amazon ECS service allow traffic to your containers.
  • Confirm that the security group rules are correctly configured between services.
  • Confirm that the public subnets have routes to the internet through an internet gateway.
  • Verify your NAT gateway configurations for private subnets.
  • Check that the CIDR ranges have sufficient IP addresses available in the subnets.
  • For configured Application Load Balancers, make sure that the Application Load Balancer security group allows an inbound rule in the security group.

CloudWatch Logs

To review the CloudWatch Logs to troubleshoot a failed Amazon ECS task, complete the following steps:

  1. Open the Amazon ECS console.
  2. In the navigation pane, choose Clusters.
  3. Select your cluster.
  4. Choose the Tasks tab.
  5. Select the Task ID for the task that failed.
  6. Check the Stopped Status to determine why the container failed.
    Note: The logs for the stopped task are available only for an hour after the ECS task has been stopped.
  7. Troubleshoot your container failure.

Health check configuration

Review the following health check settings in your Application Load Balancer or Network Load Balancer:

  • Confirm that the HealthCheckTimeoutSeconds setting is long enough for your container to succeed. If the ECS task fails the Load Balancer health check within a short period of time, then modify this value.
  • Make sure that the HealthCheckGracePeriodSeconds setting is long enough for your container to start.
  • Check that the application container responds with status code 200 at the HealthCheckPath configured in the load balancer. For more information, see the HealthCheckPath setting in Health checks for Application Load Balancer target groups.

To find failed health checks, complete the following steps:

  1. Open the Amazon EC2 console.
  2. In the navigation pane, expand Load Balancing, and then choose Target Groups.
  3. Select your Target group name.
  4. Review the Details of your target group for healthy or unhealthy instances.

To troubleshoot your unhealthy instances, see Troubleshoot your Application Load Balancers.

Related information

AWS::ECS::Service

AWS OFFICIAL
AWS OFFICIALUpdated 13 days ago