- Newest
- Most votes
- Most comments
Hello Dan Turner,
Thank you for bringing this ECS agent issue to our attention. Based on your description and recent reports from other users, there are several potential factors to consider:
-
DOCKER_HOST Configuration: Explicitly specifying DOCKER_HOST in the ECS agent configuration (even with the default value) can cause issues with cgroup management on AL2023. Try the following:
- Remove any DOCKER_HOST specification from /etc/ecs/ecs.config.
- Ensure it's not set in your user data script either.
- After removing, restart the ECS agent: sudo systemctl restart ecs
-
Container Image Version: Another customer experiencing a similar issue found that a faulty container image version was the root cause. They resolved the problem by reverting to an older version of their container image. Consider the following:
- Review your task definitions and the container images they use.
- Try rolling back to a previous version of your container images, especially if the issue started after updating them.
- Ensure your container images are compatible with the ECS environment and don't have any conflicts with cgroup management.
-
ECS Agent Version: The update from ECS Agent version 1.89.1 to 1.89.2 included an enhancement that updated the containerd/cgroups library from v3.0.2 to v3.0.4. This change might be contributing to the issue you're experiencing. You could try the following:
- Roll back to ECS Agent version 1.89.1 if it was working correctly for you before.
- Wait for a newer version of the ECS Agent that might address this issue.
-
AMI Version: If the issue persists, you might want to temporarily roll back to the previous AMI (al2023-ami-ecs-hvm-2023.0.20241213-kernel-6.1-x86_64) that was working correctly for you.
Additional Troubleshooting:
- ECS Logging collector could help you to grab all relevant files to check this further and see if there are additional errors.
- Ensure that your task definitions don't have any configurations that might conflict with cgroup management.
If these steps don't resolve your issue, please provide more details about your ECS task definitions, container images, instance types, and any custom configurations you might be using. This will help in further troubleshooting.
Lastly, it would be helpful to open a support case with AWS if you haven't already. They can provide more specific guidance based on your account and configuration details, and may have additional insights into this issue if it's affecting multiple customers.
Based on the information provided, it appears you're experiencing a specific issue with Amazon ECS (Elastic Container Service) related to cgroup errors when trying to launch new tasks. This problem seems to have started after updating to a newer AMI and ECS agent version.
The error "CgroupError: Agent could not create task's platform resources" suggests that there's an issue with the ECS agent's ability to create and manage cgroups (control groups) for your tasks. The additional error message you're seeing about "cgroupv2 create: unable to create v2 manager" indicates that this is specifically related to cgroup v2.
The fact that tasks can launch at boot but the instance eventually "dies" (unable to launch new tasks) while existing tasks continue to run is interesting. This behavior suggests that there might be a resource leak or a gradual degradation of the ECS agent's ability to manage cgroups over time.
The difference in behavior between your production instances (running ECS agent 1.89.1 and an older AMI) and the problematic instances (running ECS agent 1.89.2 and a newer AMI) suggests that this issue might be related to changes in either the ECS agent or the underlying AMI.
Given that the problem started occurring after switching to the new AMI and agent version, it's possible that there's a compatibility issue or a bug in the newer versions. The error about connecting to the systemd socket (/run/systemd/private) is particularly interesting, as it suggests there might be an issue with how the ECS agent is interacting with systemd.
To address this issue:
-
Consider rolling back to the previous AMI and ECS agent version that were working correctly (1.89.1 and the older AMI) as a temporary measure.
-
Check for any recent updates or known issues with ECS agent version 1.89.2 that might be related to cgroup management or systemd interaction.
-
Verify that your instance has the necessary permissions and configurations to manage cgroups and interact with systemd.
-
Monitor resource usage on the instances to see if there's any correlation between resource consumption and the onset of the issue.
-
Consider reaching out to AWS support for further assistance, as they may have more insight into potential bugs or compatibility issues with the specific AMI and ECS agent versions you're using.
It's possible that this is a bug in the newer ECS agent or AMI, especially given the timing of when the issue started occurring. However, without more widespread reports of similar issues, it's difficult to definitively say if this is a general bug or a specific configuration issue in your environment.
Sources
Troubleshooting Amazon ECS ContainerRuntimeError errors - Amazon Elastic Container Service
Troubleshooting Amazon ECS InternalError errors - Amazon Elastic Container Service
Relevant content
- asked a year ago
- asked 5 years ago
- asked 9 months ago
- AWS OFFICIALUpdated 7 months ago
- AWS OFFICIALUpdated 6 months ago
- AWS OFFICIALUpdated 7 months ago
Howdy!
We've not set
DOCKER_HOST
in any user data script or in/etc/ecs/ecs.config
.The container image version was working prior to the AMI upgrade ≈Sun Dec-22, we haven't changed the container image -- didn't even rebuild it, just redeployed the known-safe version from ECR. Sister teams have also started reporting the same issue with their ECS/EC2-based services. It also affects services in the cluster that have not had a new version built in >3 weeks.
I rolled back the ECS agent version and rebooted one affected instance yesterday. Unfortunately, the same problem came back after a few hours. ECS logs confirm it's running 1.89.1.
We're going to rollback the AMI Image ID to see if that fixes it. Thanks for your help :-)