Skip to content

ECS service has stopped 1 running tasks, but without a specific reason

0

Hey there.

On June 20th, a service task stopped, and a new one started, but we could not find the cause. This happened in AWS ECS deployed in us-east-2 with two EC2 instances behind. (It is not a Fargate cluster.)

  • there were no deployments.
  • there were no auto-scaling operations.
  • EC2 instances weren't rebooted.
  • Docker was running and wasn't restarted in the last 20 days.
  • Cloudtrail trails don't show anything related to this service, then.

As time passed, and I checked it today, under "Tasks," I only saw the new running task, but no older stopped tasks appeared. The application logs show nothing weird besides the new running instance.

Under events, we can see the moment the task stopped, but we don't have any further details about the cause.

2024-06-20T20:06:52.081Z service ... has reached a steady state.
2024-06-20T20:06:42.224Z service ... has stopped 1 running tasks: task 9f429fd7a19c88ae18f4ce2546d48bb.
2024-06-20T17:38:18.923Z service ... has reached a steady state.

We SSH'd into one of the EC2 instances and did a grep over the ECS logs, and this is what I found:

level=info time=2024-06-20T20:03:10Z msg="Connected to TCS endpoint"
level=info time=2024-06-20T20:06:43Z msg="Received task payload from ACS" taskARN="arn:aws:ecs:us-east-2:id:task/cluster/9f429fd7a19c88ae18f4ce2546d48bb" taskVersion="35" desiredStatus=STOPPED
level=info time=2024-06-20T20:06:43Z msg="Managed task got acs event" task="9f429fd7a19c88ae18f4ce2546d48bb" desiredStatus="STOPPED" seqnum=0
level=info time=2024-06-20T20:06:43Z msg="New acs transition" task="9f429fd7a19c88ae18f4ce2546d48bb" desiredStatus="STOPPED" seqnum=0
level=info time=2024-06-20T20:06:43Z msg="Stopping container" task="9f429fd7a19c88ae18f4ce2546d48bb" container="insert"
level=info time=2024-06-20T20:06:43Z msg="Managed task got resource" task="9f429fd7a19c88ae18f4ce2546d48bb" resource="cgroup" status="REMOVED"
level=info time=2024-06-20T20:07:02Z msg="Received task payload from ACS" taskARN="arn:aws:ecs:us-east-2:id:task/cluster/9f429fd7a19c88ae18f4ce2546d48bb" taskVersion="35" desiredStatus=STOPPED
level=info time=2024-06-20T20:07:02Z msg="Managed task got acs event" task="9f429fd7a19c88ae18f4ce2546d48bb" desiredStatus="STOPPED" seqnum=0
level=info time=2024-06-20T20:07:02Z msg="New acs transition" task="9f429fd7a19c88ae18f4ce2546d48bb" desiredStatus="STOPPED" seqnum=0

We also found this by checking the messages-* logs, which led us to think there may be something wrong with the virtual interfaces.

un 20 20:02:56 ip-10-0-40-79 dhclient[790]: XMT: Solicit on eth0, interval 114240ms.
Jun 20 20:04:50 ip-10-0-40-79 dhclient[790]: XMT: Solicit on eth0, interval 130370ms.
Jun 20 20:07:01 ip-10-0-40-79 dhclient[790]: XMT: Solicit on eth0, interval 108460ms.
Jun 20 20:07:03 ip-10-0-40-79 kernel: docker0: port 1(veth2651025) entered blocking state
Jun 20 20:07:03 ip-10-0-40-79 kernel: docker0: port 1(veth2651025) entered disabled state
Jun 20 20:07:03 ip-10-0-40-79 kernel: device veth2651025 entered promiscuous mode
Jun 20 20:07:03 ip-10-0-40-79 kernel: IPv6: ADDRCONF(NETDEV_UP): veth2651025: link is not ready
Jun 20 20:07:03 ip-10-0-40-79 dockerd: time="2024-06-20T20:07:03.403036749Z" level=info msg="Configured log driver does not support reads, enabling local file cache for container logs" container=7f4e3a48d6008bb128cb348cebaeac2367864e78e7eb6a6bd75218e67a7e6af9 driver=awslogs
Jun 20 20:07:03 ip-10-0-40-79 containerd: time="2024-06-20T20:07:03.412707748Z" level=info msg="loading plugin \"io.containerd.internal.v1.shutdown\"..." runtime=io.containerd.runc.v2 type=io.containerd.internal.v1
Jun 20 20:07:03 ip-10-0-40-79 containerd: time="2024-06-20T20:07:03.412776531Z" level=info msg="loading plugin \"io.containerd.ttrpc.v1.pause\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Jun 20 20:07:03 ip-10-0-40-79 containerd: time="2024-06-20T20:07:03.412797594Z" level=info msg="loading plugin \"io.containerd.event.v1.publisher\"..." runtime=io.containerd.runc.v2 type=io.containerd.event.v1
Jun 20 20:07:03 ip-10-0-40-79 containerd: time="2024-06-20T20:07:03.412813717Z" level=info msg="loading plugin \"io.containerd.ttrpc.v1.task\"..." runtime=io.containerd.runc.v2 type=io.containerd.ttrpc.v1
Jun 20 20:07:03 ip-10-0-40-79 kernel: eth0: renamed from vethc164f34
Jun 20 20:07:03 ip-10-0-40-79 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth2651025: link becomes ready
Jun 20 20:07:03 ip-10-0-40-79 kernel: docker0: port 1(veth2651025) entered blocking state
Jun 20 20:07:03 ip-10-0-40-79 kernel: docker0: port 1(veth2651025) entered forwarding state
Jun 20 20:07:13 ip-10-0-40-79 dockerd: time="2024-06-20T20:07:13.532232981Z" level=info msg="Container failed to exit within 30s of signal 15 - using the force" container=e4756f5844a8fa6bcf8a5e916ef75e52add4b82822a913d8cf31de7e5e5b0afb
Jun 20 20:07:13 ip-10-0-40-79 dockerd: time="2024-06-20T20:07:13.651243993Z" level=info msg="ignoring event" container=e4756f5844a8fa6bcf8a5e916ef75e52add4b82822a913d8cf31de7e5e5b0afb module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
Jun 20 20:07:13 ip-10-0-40-79 containerd: time="2024-06-20T20:07:13.651360791Z" level=info msg="shim disconnected" id=e4756f5844a8fa6bcf8a5e916ef75e52add4b82822a913d8cf31de7e5e5b0afb namespace=moby
Jun 20 20:07:13 ip-10-0-40-79 containerd: time="2024-06-20T20:07:13.651416307Z" level=warning msg="cleaning up after shim disconnected" id=e4756f5844a8fa6bcf8a5e916ef75e52add4b82822a913d8cf31de7e5e5b0afb namespace=moby
Jun 20 20:07:13 ip-10-0-40-79 containerd: time="2024-06-20T20:07:13.651425423Z" level=info msg="cleaning up dead shim" namespace=moby

A quick Google search about what "Received task payload from ACS" means gave me almost nothing, and I am willing to learn about this and understand what happened.

Any input will be appreciated, thanks!

2 Answers
1

Hi,

Please try this solution it will be helpful for you.

To resolve the issue of an unexpected ECS task stop:

Step 1: Check ECS Service Events

Go to the ECS Console:

Navigate to your cluster and then to the specific service.

Check Events Tab:

Look for any events around the time the task stopped for any specific error messages or reasons.

Step 2: Review ECS Agent Logs

SSH into the EC2 instances:

Locate ECS agent logs in /var/log/ecs/ecs-agent.log*.

Check Logs:

Look for log entries around the time of the incident for any errors or warnings.

Step 3: Monitor Resource Utilization

CloudWatch Metrics:

Open CloudWatch and check CPU and memory usage for the tasks and EC2 instances around the incident time to ensure there were no resource shortages.

Step 4: Investigate Network Issues

Review EC2 Network Logs:

Check /var/log/messages or /var/log/syslog for network-related entries, especially DHCP solicitations or network interface state changes.

Step 5: Review Task Definition and ECS Configuration

Task Definition:

Ensure health checks and task timeouts are correctly configured.

Service Configuration:

Verify there are no misconfigurations in the deployment or health check settings.

Step 6: Use CloudWatch Logs Insights

Go to CloudWatch Logs:

Navigate to your log group for the ECS service.

Run Insights Query:

Use the following query to filter logs around the incident time.

fields @timestamp, @message
| filter @timestamp >= '2024-06-20T20:00:00Z' and @timestamp <= '2024-06-20T20:10:00Z'
| sort @timestamp desc

if you want more information please go through the AWS Document link.

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/stopped-task-errors.html

https://repost.aws/knowledge-center/ecs-tasks-container-exit-issues

EXPERT
answered 2 years ago
EXPERT
reviewed 2 years ago
  • Hi,

    1. there are no related events in the Events tab.
    2. the relevant logs were posted in my original post.
    3. there were no spikes in resource usage, not in the node or container.
    4. the relevant logs related to networking were posted in my original post.
    5. the task definition(s) are ok.
    6. Log insights don't show anything besides the logs I posted.

    thanks!

0

Known factors:

A service task (ID: 9f429fd7a19c88ae18f4ce2546d48bb) stopped on June 20th at 20:06:42 UTC. No deployments, scaling operations, reboots, or Docker restarts occurred around that time. CloudTrail logs show no relevant activity. Application logs appear normal for the newly running task.

ECS event logs:

The ECS event logs show the task transitioning to a STOPPED state initiated by ACS (Amazon ECS Agent). EC2 instance logs (messages-*)

The logs indicate DHCP client (dhclient) activity around the time of the task stop, suggesting a potential network renewal attempt. Docker logs show the container failing to exit gracefully within 30 seconds of receiving a signal 15 (termination signal).

Key takeaways:

The lack of usual triggers for a task stop suggests an external factor might be at play. The network renewal attempt by the DHCP client and the docker container failing to exit gracefully raise suspicion of a network connectivity issue around the time of the stop.

Possible causes:

Network connectivity issue: A temporary network disruption on the EC2 instance could have caused the container to become unresponsive, leading ECS to stop the task. This aligns with the observed DHCP renewal attempt and container termination behavior. Resource exhaustion: Though less likely without evidence, resource constraints like memory or CPU limitations could have triggered the container to crash and subsequently stop the task.

Recommendations:

Investigate network logs on the EC2 instance for any anomalies around June 20th, 20:06 UTC. This could involve checking for errors or dropped packets. Consider enabling ECS service logs to capture detailed information about task failures in the future. These logs can be found in CloudWatch. If network connectivity issues are suspected, review network configurations and ensure proper communication between the container and external services. If resource exhaustion is a concern, monitor resource utilization on the EC2 instances and consider scaling if necessary.

EXPERT
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.