- Newest
- Most votes
- Most comments
Hi,
One of the AWS Containers Developer Advocates put together this handy tool to check if your environment is fully setup for ECS Exec:
https://github.com/aws-containers/amazon-ecs-exec-checker
Would you mind trying that and see if it reveals anything interesting?
Thanks!
/Mats
+1 for ecs-exec-checker tool mentioned by Mats, it would be helpful for most cases.
Looking at the logs we have in ECS & Fargate, the backend and agent are looking good. The issue instead is that the ExecuteCommand agent is not able to start up inside the containers and this usually happens for the following scenarios (please note ECS Exec is built on top of leveraging SSM agent):
- The task uses private links and does not have required SSM endpoints, see https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started-privatelink.html
- The containers/tasks reuses some port mappings across multiple containers or tasks especially when the tasks are configured to use the same appmesh
- Your container filesystem is readonly and the SSM agent is not able to create the required files and folders
- It only allows one SSM agent running inside the container. If you already have an SSM agent built into your image and started up as part of container launching, the SSM agent started up for ECS Exec will always fail
There might be more corner cases we don't know that could cause this issue. If you can provide the logs inside your container under /var/log/amazon/ssm/amazon-ssm-agent.log as part of your container logs, we can take a deep look into it.
Edited by: xue-aws on Apr 7, 2021 9:22 PM
That's a great tool! I may have missed it in the official docs; it should definitely be added if it isn't in there.
Anyway, here's my output:
~ ⌚ 9:47:49
$ bash <( curl -Ls https://raw.githubusercontent.com/aws-containers/amazon-ecs-exec-checker/main/check-ecs-exec.sh ) <redacted> f0cd742fafca4430bdc1fc1fc2939d18
Prerequisites for check-ecs-exec.sh
jq | OK (/usr/local/bin/jq)
AWS CLI | OK (/usr/local/bin/aws)
Prerequisites for the AWS CLI to use ECS Exec
AWS CLI Version | OK (aws-cli/2.1.32 Python/3.9.2 Darwin/19.6.0 source/x86_64 prompt/off)
Session Manager Plugin | OK (1.2.54.0)
Configurations for ECS task and other resources
Region : us-west-2
Cluster: <redacted>
Task : f0cd742fafca4430bdc1fc1fc2939d18
Cluster Configuration | Audit Logging Not Configured
Can I ExecuteCommand? | arn:aws:iam::<redacted>
ecs:ExecuteCommand: allowed
ssm:StartSession denied?: allowed
Launch Type | Fargate
Platform Version | 1.4.0
Exec Enabled for Task | OK
Managed Agent Status |
1. STOPPED (Reason: null) for "<redacted>" container
Task Role Permissions | arn:aws:iam::<redacted>
ssmmessages:CreateControlChannel: allowed
ssmmessages:CreateDataChannel: allowed
ssmmessages:OpenControlChannel: allowed
ssmmessages:OpenDataChannel: allowed
VPC Endpoints | SKIPPED (vpc-<redacted> - No additional VPC endpoints required)
Unfortunately it looks like everything is okay but the agent is stopped.
xue-aws wrote:
- The task uses private links and does not have required SSM endpoints, see https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started-privatelink.html
- The containers/tasks reuses some port mappings across multiple containers or tasks especially when the tasks are configured to use the same appmesh
- Your container filesystem is readonly and the SSM agent is not able to create the required files and folders
- It only allows one SSM agent running inside the container. If you already have an SSM agent built into your image and started up as part of container launching, the SSM agent started up for ECS Exec will always fail
- We aren't using PrivateLink yet. The tool confirmed this as well.
- Interesting. We're only running a single container in the task; I don't think there should be any overlapping ports.
- I didn't set the container file system to readonly; I would be surprised if our server worked in that case since we write logs to disk as well to stdout.
- Hmmm we're using an image that shouldn't have the SSM agent in it, but this is an interesting lead.
I'll try to get some logs!
I didn't spend time pulling the logs (yet) but instead took a different approach. I'm not 100% clear on exactly what initProcessEnabled
does (not a Docker expert) other than reaping child processes that a user runs when they call execute-command
. Maybe that's the whole point?
Anyway, I modified our task definition and set initProcessEnabled
to true. Our environments can now run execute-command again. Hooray!
Maybe there's something to do with the non-initProcessEnabled code path that causes the initial bug. I'm not convinced this change should have done anything meaningful to allow the agent to work. I still think it's very strange that it was reported as "RUNNING" (e.g. started successfully) and then only when I try to connect does it transition to "STOPPED".
I was unable to pull logs since I didn't have access to the containers. I figured I would have to add a script to tail/dump the logs when the container starts, but wanted to try the simpler infrastructure change first.
Thanks for all your help.
xue-aws wrote:
+1 for ecs-exec-checker tool mentioned by Mats, it would be helpful for most cases.Looking at the logs we have in ECS & Fargate, the backend and agent are looking good. The issue instead is that the ExecuteCommand agent is not able to start up inside the containers and this usually happens for the following scenarios (please note ECS Exec is built on top of leveraging SSM agent):
- The task uses private links and does not have required SSM endpoints, see https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started-privatelink.html
- The containers/tasks reuses some port mappings across multiple containers or tasks especially when the tasks are configured to use the same appmesh
- Your container filesystem is readonly and the SSM agent is not able to create the required files and folders
- It only allows one SSM agent running inside the container. If you already have an SSM agent built into your image and started up as part of container launching, the SSM agent started up for ECS Exec will always fail
There might be more corner cases we don't know that could cause this issue. If you can provide the logs inside your container under /var/log/amazon/ssm/amazon-ssm-agent.log as part of your container logs, we can take a deep look into it.
Edited by: xue-aws on Apr 7, 2021 9:22 PM
xue-aws wrote:
+1 for ecs-exec-checker tool mentioned by Mats, it would be helpful for most cases.Looking at the logs we have in ECS & Fargate, the backend and agent are looking good. The issue instead is that the ExecuteCommand agent is not able to start up inside the containers and this usually happens for the following scenarios (please note ECS Exec is built on top of leveraging SSM agent):
- The task uses private links and does not have required SSM endpoints, see https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-getting-started-privatelink.html
- The containers/tasks reuses some port mappings across multiple containers or tasks especially when the tasks are configured to use the same appmesh
- Your container filesystem is readonly and the SSM agent is not able to create the required files and folders
- It only allows one SSM agent running inside the container. If you already have an SSM agent built into your image and started up as part of container launching, the SSM agent started up for ECS Exec will always fail
There might be more corner cases we don't know that could cause this issue. If you can provide the logs inside your container under /var/log/amazon/ssm/amazon-ssm-agent.log as part of your container logs, we can take a deep look into it.
Edited by: xue-aws on Apr 7, 2021 9:22 PM
Unfortunately it started happening again. It seems random. I'll try to get some logs this time around.
Edit: Woohoo, success. Here are the logs.
2021-04-16 00:13:59 INFO [ssm-agent-worker] [MessageGatewayService] [EngineProcessor] Starting
2021-04-16 00:13:59 INFO [ssm-agent-worker] [MessageGatewayService] SSM Agent is trying to setup control channel for Session Manager module.
2021-04-16 00:13:59 INFO [ssm-agent-worker] [MessageGatewayService] listening reply.
2021-04-16 00:13:59 INFO [ssm-agent-worker] [MessageGatewayService] agent telemetry cloudwatch metrics disabled
2021-04-16 00:13:59 INFO [ssm-agent-worker] [MessageGatewayService] Setting up websocket for controlchannel for instance: ecs:<redacted_c864dbc6e9404505ae8a07a0f79b3992_c864dbc6e9404505ae8a07a0f79b3992-2907402, requestId: 424ec1d7-16c9-45a2-ac61-ada12d5640b4
2021-04-16 00:14:00 INFO [ssm-agent-worker] [MessageGatewayService] Opening websocket connection to: wss://ssmmessages.us-west-2.amazonaws.com/v1/control-channel/ecs:<redacted>_c864dbc6e9404505ae8a07a0f79b3992_c864dbc6e9404505ae8a07a0f79b3992-2907402?role=subscribe&stream=input
2021-04-16 00:14:00 INFO [ssm-agent-worker] [MessageGatewayService] Successfully opened websocket connection to: wss://ssmmessages.us-west-2.amazonaws.com/v1/control-channel/ecs:<redacted>_c864dbc6e9404505ae8a07a0f79b3992_c864dbc6e9404505ae8a07a0f79b3992-2907402?role=subscribe&stream=input
2021-04-16 00:14:00 INFO [ssm-agent-worker] [MessageGatewayService] Setting up agent telemetry scheduler
2021-04-16 00:14:00 INFO [ssm-agent-worker] [MessageGatewayService] Starting receiving message from control channel
2021-04-16 00:14:00 INFO [ssm-agent-worker] [MessageGatewayService] [EngineProcessor] Initial processing
2021-04-16 00:14:15 ERROR error occurred when starting amazon-ssm-agent: failed to start message bus, failed to start health channel: failed to listen on the channel: ipc:///var/lib/amazon/ssm/ipc/health, address in use
https://stackoverflow.com/questions/65218749/unable-to-start-the-amazon-ssm-agent-failed-to-start-message-bus is my only lead.
I'm unsure how it could possibly be in use, given it's a brand new container. Looks like this is an existing GitHub issue: https://github.com/aws/amazon-ssm-agent/issues/361
Edited by: nscott on Apr 15, 2021 5:15 PM
Edited by: nscott on Apr 15, 2021 5:19 PM
Edited by: nscott on Apr 15, 2021 5:24 PM
This is happening to me as well. Did everything suggested by ecs task checker and still have this behavior.
I Fixed this in my case by removing the ReadOnlyFileSystem in the Task Definition
Have the same problem for me as well, amazon-ecs-exec-checker shows only a warning for
Init Process Enabled
@nscott did you find the solution?
Relevant content
- Accepted Answerasked a month ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
I have a similar problem, but I'm using read-only filesystems. What are the directories that should be writable so I can map them to tempfs.