How do I troubleshoot a disconnected Amazon ECS agent?
My container instances for Amazon Elastic Container Service (Amazon ECS) are disconnected.
Short description
Your Amazon ECS container agent might disconnect and reconnect several times an hour as part of the normal operation. These change events are normal and aren't a cause for concern. Connection events that last for only a few minutes might not indicate issues with container agent or your container instance. However, if the container agent remains in the disconnected state for a longer time, then the container instance can't operate as part of your Amazon ECS cluster. This issue might be caused due to the following reasons:
- Networking issues prevent communication between the instance and Amazon ECS.
- The container agent doesn't have the required AWS Identity and Access Management (IAM) permissions to communicate with Amazon ECS endpoints.
- There are problems with the host or Docker daemon inside the container instance.
- There is resource contention in the underlying host.
Note: It is best practice to use the latest version of Amazon ECS container agent when possible. For more information, see Container instance lifecycle.
Resolution
Note: The following resolution applies to Amazon ECS-optimized Amazon Linux 2 AMIs. For a resolution that applies to Amazon ECS-optimized Amazon Linux 1 AMIs, see Why are my Amazon ECS container instances with Amazon Linux 1 AMIs disconnected?
You can connect to your Amazon EC2 instances using SSH keys. If you don't have the SSH keys generated, you can connect to your instance using Session Manager. By default, AWS Systems Manager Agent is installed on Amazon Linux 2 AMIs and Amazon Linux 2 ECS-optimized base AMI.
Verify that the container agent is running on the container instance
To verify the status and connectivity of the Amazon ECS container agent, run either of the following commands on your container instance:
$ sudo systemctl status ecs $ sudo docker ps -f name=ecs-agent
The output specifies active (running) and looks similar to the following:
ecs.service - Amazon Elastic Container Service - container agent Loaded: loaded (/usr/lib/systemd/system/ecs.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2022-02-15 15:51:09 UTC; 37min ago Docs: https://aws.amazon.com/documentation/ecs/ Process: 30039 ExecStopPost=/usr/libexec/amazon-ecs-init post-stop (code=exited, status=0/SUCCESS) Process: 29987 ExecStop=/usr/libexec/amazon-ecs-init stop (code=exited, status=0/SUCCESS) Process: 30077 ExecStartPre=/usr/libexec/amazon-ecs-init pre-start (code=exited, status=0/SUCCESS) Main PID: 30123 (amazon-ecs-init) Tasks: 5 Memory: 3.7M CGroup: /system.slice/ecs.service └─30123 /usr/libexec/amazon-ecs-init start
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES eb1dc8d4ab3b amazon/amazon-ecs-agent:latest "/agent" 3 days ago Up 3 days (healthy) ecs-agent
If the issue is caused due to a disconnected agent, then restart the ECS agent by running the following command:
$ sudo systemctl restart ecs
Note: You don't see any output after running these commands.
To verify that the agent is running, run the following command:
sudo systemctl status ecs
Verify that the Docker service is running on the container instance
To verify that the Docker service is running on the affected container instance, run the following command:
sudo systemctl status docker
The output specifies active (running) and looks similar to the following:
docker.service - Docker Application Container Engine Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled) Active: active (running) since Fri 2022-02-11 17:42:32 UTC; 3 days ago Docs: https://docs.docker.com Process: 4307 ExecStartPre=/usr/libexec/docker/docker-setup-runtimes.sh (code=exited, status=0/SUCCESS) Process: 4296 ExecStartPre=/bin/mkdir -p /run/docker (code=exited, status=0/SUCCESS) Main PID: 4315 (dockerd) Tasks: 24 Memory: 360.5M CGroup: /system.slice/docker.service ├─4315 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --default-ulimit nofile=32768:65536 ├─6010 /usr/bin/docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 80 -container-ip 172.17.0.2 -container-port 80 └─6016 /usr/bin/docker-proxy -proto tcp -host-ip :: -host-port 80 -container-ip 172.17.0.2 -container-port 80
If the Docker service is inactive, then run the following command to restart the Docker service:
sudo systemctl restart docker
Note: The command doesn't return any output.
To verify that the Docker service has restarted, run the following command:
sudo systemctl status docker
Review log files for the container agent and Docker
If your container instance is still disconnected, then review the log files on the container host for the container agent and Docker.
Check the following log files for keywords, such as "error", "warn", or "agent transition state":
- View the Amazon ECS container agent's latest logs at /var/log/ecs/ecs-agent.log Note: You can view the rotated log by filtering to /var/log/ecs/ecs-agent-log.timestamp
- View the Amazon ECS init log at /var/log/ecs/ecs-init.log
- View the userdata execution logs at /var/log/cloud-init.log
- View the Docker Daemon logs with the command sudo journalctl -u docker
If you are using Linux, you can also review the exit code for more information on the stopped agent container. To get the exit code, run the following command:
docker inspect <your container ID>
Replace <your container ID> with the ID of the stopped container.
Note: You can choose to use the Amazon ECS logs collector to collect general operating system logs, Docker logs, and container agent logs for Amazon ECS.
Verify that the IAM instance profile has the necessary permissions
If the container agent is still disconnected, verify that the IAM instance profile associated with the container instance has the necessary IAM permissions.
1. Connect to the instance using SSH or Session Manager.
2. To view the instance metadata on the instance profile associated with the instance, run the following command:
curl http://169.254.169.254/latest/meta-data/iam/info
The output looks similar to the following:
{ "Code" : "Success", "LastUpdated" : "2022-02-16T22:42:17Z", "InstanceProfileArn" : "arn:aws:iam::1122334455:instance-profile/ecsInstanceRole", "InstanceProfileId" : "AIPA4VIZXOFF55F72XIZN" }
3. Verify that the IAM role contains the correct permissions for your container instances.
4. To verify specific credential errors with the container agent, run a command similar to the following to check the container agent log for a list of ECS logs:
Be sure to replace YYYY-MM-DD-** with the relevant timestamp.
cat /var/log/ecs/ecs-agent.log.YYYY-MM-DD-**
Note: The container agent log is rotated every hour. The suffix automatically changes to reflect the current date and time. Update the command to include the date range and log ID for when the issue occurred.
Verify that your container instance has enough resources to run the ECS agent
If your tasks have a high memory/CPU utilization, then your container instance might not have enough resources to run the ECS agent.
The Amazon ECS container agent uses the Docker ReadMemInfo() function to query the amount of memory that's available for the operating system.
Run the following command on your container instance to view the total memory that's recognized by the operating system:
free -b
Example output for an t2.large instance running the Amazon ECS-optimized Amazon Linux AMI:
total used free shared buff/cache available Mem: 8361193472 298577920 7325388800 405504 737226752 7844274176 Swap: 0 0 0
You can choose to reserve some memory for the Amazon ECS container agent and other critical system processes on your container instances, so that your task's containers don't contend for the same memory. For more information, see Container instance memory management.
Verify that the environment variable ECS_CLUSTER has the correct cluster name
If the Amazon ECS container agent configuration parameter ECS_CLUSTER has the incorrect cluster name, then the container instance can't join the cluster. Check the contents of the /etc/ecs/ecs.config file to verify this parameter.
cat /etc/ecs/ecs.config
Verify that the ECS agent can communicate to ECS endpoints
Be sure that the network access control lists and security group used by the container instance allow outbound connections on port 443 (HTTPS) to connect with ECS endpoints.
Run either of the following commands on your container instance to check the outbound connections to ECS endpoints (ACS/TCS):
sudo yum install telnet -y $ telnet ecs.region.amazonaws.com 443
-or-
$ curl https://ecs.region.amazonaws.com
The following are some of the best practices to keep in mind:
- Use the Amazon ECS-optimized AMI for your container instances unless your application requires a specific operating system or a Docker version that's not yet available in that AMI to run your ECS workloads.
- When possible, use the latest version of Amazon ECS container agent. The latest version has enhanced features and provides bug fixes from previous versions.
- Configure tasks with CPU and memory limits.
Related information

Contenido relevante
- OFICIAL DE AWSActualizada hace un año
- OFICIAL DE AWSActualizada hace un año
- OFICIAL DE AWSActualizada hace 6 meses
- OFICIAL DE AWSActualizada hace 2 meses