(Fargate) ExecuteCommandAgent transitions from RUNNING to STOPPED

0

Hi, I recently followed all the guidance to enable ECS ExecuteCommand access for my containers (excellent feature).

I followed https://aws.amazon.com/blogs/containers/new-using-amazon-ecs-exec-access-your-containers-fargate-ec2/, deployed the new infrastructure, and I was able to connect to my ECS Fargate task. Success! Very exciting.

I go back today to try to troubleshoot a problem in our beta environment, and I keep getting...

"An error occurred (InvalidParameterException) when calling the ExecuteCommand operation: The execute command failed because execute command was not enabled when the task was run or the execute command agent isn’t running. Wait and try again or run a new task with execute command enabled and try again."

My IAM permissions haven't changed and the cluster/service/task configuration hasn't changed. I double-checked by re-running Terraform to ensure the plans were the same (they are - no diff between what's in the account and what's in the templates).

I spun up a new task and described the state of the task:

$ aws --region us-west-2 --profile <redacted> ecs describe-tasks --cluster <redacted> --tasks 1fe5bd64db4d428794fa0b956c1efda6 | jq '.tasks[0] | {"managedAgents": .containers[0].managedAgents, "enableExecuteCommand": .enableExecuteCommand}'
{
"managedAgents": [
{
"lastStartedAt": "2021-04-07T09:10:01.885000-04:00",
"name": "ExecuteCommandAgent",
"lastStatus": "RUNNING"
}
],
"enableExecuteCommand": true
}

I then tried to connect:

aws --region us-west-2 --profile <redacted> ecs execute-command --cluster <redacted> \
--task 1fe5bd64db4d428794fa0b956c1efda6 \
--container <redacted> \
--interactive \
--command "/bin/bash"

Got the error I pasted above. Then I checked the status again.

$ aws --region us-west-2 --profile <redacted> ecs describe-tasks --cluster <redacted> --tasks 1fe5bd64db4d428794fa0b956c1efda6 | jq '.tasks[0] | {"managedAgents": .containers[0].managedAgents, "enableExecuteCommand": .enableExecuteCommand}'
{
"managedAgents": [
{
"name": "ExecuteCommandAgent",
"lastStatus": "STOPPED"
}
],
"enableExecuteCommand": true
}

There's ample RAM available. I'm not sure why it isn't working after nothing has changed. I don't have any additional tools to troubleshoot this. Worse, it's also happening in our production environment too.

Any suggestions?

nscott
asked 3 years ago5162 views
10 Answers
1

Hi,

One of the AWS Containers Developer Advocates put together this handy tool to check if your environment is fully setup for ECS Exec:

https://github.com/aws-containers/amazon-ecs-exec-checker

Would you mind trying that and see if it reveals anything interesting?

Thanks!

/Mats

profile pictureAWS
Mats
answered 3 years ago
0

+1 for ecs-exec-checker tool mentioned by Mats, it would be helpful for most cases.

Looking at the logs we have in ECS & Fargate, the backend and agent are looking good. The issue instead is that the ExecuteCommand agent is not able to start up inside the containers and this usually happens for the following scenarios (please note ECS Exec is built on top of leveraging SSM agent):

  1. The task uses private links and does not have required SSM endpoints, see https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started-privatelink.html
  2. The containers/tasks reuses some port mappings across multiple containers or tasks especially when the tasks are configured to use the same appmesh
  3. Your container filesystem is readonly and the SSM agent is not able to create the required files and folders
  4. It only allows one SSM agent running inside the container. If you already have an SSM agent built into your image and started up as part of container launching, the SSM agent started up for ECS Exec will always fail

There might be more corner cases we don't know that could cause this issue. If you can provide the logs inside your container under /var/log/amazon/ssm/amazon-ssm-agent.log as part of your container logs, we can take a deep look into it.

Edited by: xue-aws on Apr 7, 2021 9:22 PM

AWS
answered 3 years ago
  • I have a similar problem, but I'm using read-only filesystems. What are the directories that should be writable so I can map them to tempfs.

0

That's a great tool! I may have missed it in the official docs; it should definitely be added if it isn't in there.

Anyway, here's my output:

~ ⌚ 9:47:49
$ bash <( curl -Ls https://raw.githubusercontent.com/aws-containers/amazon-ecs-exec-checker/main/check-ecs-exec.sh ) <redacted> f0cd742fafca4430bdc1fc1fc2939d18

Prerequisites for check-ecs-exec.sh

jq | OK (/usr/local/bin/jq)
AWS CLI | OK (/usr/local/bin/aws)


Prerequisites for the AWS CLI to use ECS Exec

AWS CLI Version | OK (aws-cli/2.1.32 Python/3.9.2 Darwin/19.6.0 source/x86_64 prompt/off)
Session Manager Plugin | OK (1.2.54.0)


Configurations for ECS task and other resources

Region : us-west-2
Cluster: <redacted>
Task : f0cd742fafca4430bdc1fc1fc2939d18

Cluster Configuration | Audit Logging Not Configured
Can I ExecuteCommand? | arn:aws:iam::<redacted>
ecs:ExecuteCommand: allowed
ssm:StartSession denied?: allowed
Launch Type | Fargate
Platform Version | 1.4.0
Exec Enabled for Task | OK
Managed Agent Status |
1. STOPPED (Reason: null) for "<redacted>" container
Task Role Permissions | arn:aws:iam::<redacted>
ssmmessages:CreateControlChannel: allowed
ssmmessages:CreateDataChannel: allowed
ssmmessages:OpenControlChannel: allowed
ssmmessages:OpenDataChannel: allowed
VPC Endpoints | SKIPPED (vpc-<redacted> - No additional VPC endpoints required)

Unfortunately it looks like everything is okay but the agent is stopped.

nscott
answered 3 years ago
0

xue-aws wrote:

  1. The task uses private links and does not have required SSM endpoints, see https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started-privatelink.html
  2. The containers/tasks reuses some port mappings across multiple containers or tasks especially when the tasks are configured to use the same appmesh
  3. Your container filesystem is readonly and the SSM agent is not able to create the required files and folders
  4. It only allows one SSM agent running inside the container. If you already have an SSM agent built into your image and started up as part of container launching, the SSM agent started up for ECS Exec will always fail
  1. We aren't using PrivateLink yet. The tool confirmed this as well.
  2. Interesting. We're only running a single container in the task; I don't think there should be any overlapping ports.
  3. I didn't set the container file system to readonly; I would be surprised if our server worked in that case since we write logs to disk as well to stdout.
  4. Hmmm we're using an image that shouldn't have the SSM agent in it, but this is an interesting lead.

I'll try to get some logs!

nscott
answered 3 years ago
0

I didn't spend time pulling the logs (yet) but instead took a different approach. I'm not 100% clear on exactly what initProcessEnabled does (not a Docker expert) other than reaping child processes that a user runs when they call execute-command. Maybe that's the whole point?

Anyway, I modified our task definition and set initProcessEnabled to true. Our environments can now run execute-command again. Hooray!

Maybe there's something to do with the non-initProcessEnabled code path that causes the initial bug. I'm not convinced this change should have done anything meaningful to allow the agent to work. I still think it's very strange that it was reported as "RUNNING" (e.g. started successfully) and then only when I try to connect does it transition to "STOPPED".

I was unable to pull logs since I didn't have access to the containers. I figured I would have to add a script to tail/dump the logs when the container starts, but wanted to try the simpler infrastructure change first.

Thanks for all your help.

nscott
answered 3 years ago
0

The problem started happening again

nscott
answered 3 years ago
0

xue-aws wrote:
+1 for ecs-exec-checker tool mentioned by Mats, it would be helpful for most cases.

Looking at the logs we have in ECS & Fargate, the backend and agent are looking good. The issue instead is that the ExecuteCommand agent is not able to start up inside the containers and this usually happens for the following scenarios (please note ECS Exec is built on top of leveraging SSM agent):

  1. The task uses private links and does not have required SSM endpoints, see https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started-privatelink.html
  2. The containers/tasks reuses some port mappings across multiple containers or tasks especially when the tasks are configured to use the same appmesh
  3. Your container filesystem is readonly and the SSM agent is not able to create the required files and folders
  4. It only allows one SSM agent running inside the container. If you already have an SSM agent built into your image and started up as part of container launching, the SSM agent started up for ECS Exec will always fail

There might be more corner cases we don't know that could cause this issue. If you can provide the logs inside your container under /var/log/amazon/ssm/amazon-ssm-agent.log as part of your container logs, we can take a deep look into it.

Edited by: xue-aws on Apr 7, 2021 9:22 PM
xue-aws wrote:
+1 for ecs-exec-checker tool mentioned by Mats, it would be helpful for most cases.

Looking at the logs we have in ECS & Fargate, the backend and agent are looking good. The issue instead is that the ExecuteCommand agent is not able to start up inside the containers and this usually happens for the following scenarios (please note ECS Exec is built on top of leveraging SSM agent):

  1. The task uses private links and does not have required SSM endpoints, see https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started-privatelink.html
  2. The containers/tasks reuses some port mappings across multiple containers or tasks especially when the tasks are configured to use the same appmesh
  3. Your container filesystem is readonly and the SSM agent is not able to create the required files and folders
  4. It only allows one SSM agent running inside the container. If you already have an SSM agent built into your image and started up as part of container launching, the SSM agent started up for ECS Exec will always fail

There might be more corner cases we don't know that could cause this issue. If you can provide the logs inside your container under /var/log/amazon/ssm/amazon-ssm-agent.log as part of your container logs, we can take a deep look into it.

Edited by: xue-aws on Apr 7, 2021 9:22 PM

Unfortunately it started happening again. It seems random. I'll try to get some logs this time around.

Edit: Woohoo, success. Here are the logs.

2021-04-16 00:13:59 INFO [ssm-agent-worker] [MessageGatewayService] [EngineProcessor] Starting
2021-04-16 00:13:59 INFO [ssm-agent-worker] [MessageGatewayService] SSM Agent is trying to setup control channel for Session Manager module.
2021-04-16 00:13:59 INFO [ssm-agent-worker] [MessageGatewayService] listening reply.
2021-04-16 00:13:59 INFO [ssm-agent-worker] [MessageGatewayService] agent telemetry cloudwatch metrics disabled
2021-04-16 00:13:59 INFO [ssm-agent-worker] [MessageGatewayService] Setting up websocket for controlchannel for instance: ecs:<redacted_c864dbc6e9404505ae8a07a0f79b3992_c864dbc6e9404505ae8a07a0f79b3992-2907402, requestId: 424ec1d7-16c9-45a2-ac61-ada12d5640b4
2021-04-16 00:14:00 INFO [ssm-agent-worker] [MessageGatewayService] Opening websocket connection to: wss://ssmmessages.us-west-2.amazonaws.com/v1/control-channel/ecs:<redacted>_c864dbc6e9404505ae8a07a0f79b3992_c864dbc6e9404505ae8a07a0f79b3992-2907402?role=subscribe&stream=input
2021-04-16 00:14:00 INFO [ssm-agent-worker] [MessageGatewayService] Successfully opened websocket connection to: wss://ssmmessages.us-west-2.amazonaws.com/v1/control-channel/ecs:<redacted>_c864dbc6e9404505ae8a07a0f79b3992_c864dbc6e9404505ae8a07a0f79b3992-2907402?role=subscribe&stream=input
2021-04-16 00:14:00 INFO [ssm-agent-worker] [MessageGatewayService] Setting up agent telemetry scheduler
2021-04-16 00:14:00 INFO [ssm-agent-worker] [MessageGatewayService] Starting receiving message from control channel
2021-04-16 00:14:00 INFO [ssm-agent-worker] [MessageGatewayService] [EngineProcessor] Initial processing
2021-04-16 00:14:15 ERROR error occurred when starting amazon-ssm-agent: failed to start message bus, failed to start health channel: failed to listen on the channel: ipc:///var/lib/amazon/ssm/ipc/health, address in use

https://stackoverflow.com/questions/65218749/unable-to-start-the-amazon-ssm-agent-failed-to-start-message-bus is my only lead.

I'm unsure how it could possibly be in use, given it's a brand new container. Looks like this is an existing GitHub issue: https://github.com/aws/amazon-ssm-agent/issues/361

Edited by: nscott on Apr 15, 2021 5:15 PM

Edited by: nscott on Apr 15, 2021 5:19 PM

Edited by: nscott on Apr 15, 2021 5:24 PM

nscott
answered 3 years ago
0

This is happening to me as well. Did everything suggested by ecs task checker and still have this behavior.

answered 3 years ago
0

I Fixed this in my case by removing the ReadOnlyFileSystem in the Task Definition

AWS
answered a year ago
0

Have the same problem for me as well, amazon-ecs-exec-checker shows only a warning for

Init Process Enabled

@nscott did you find the solution?

answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions