ECS execute-command fails when running service on EC2 cluster with network mode awsvpc

0

As the title says I ran into this issue where aws ecs execute-command is failing with the error

An error occurred (TargetNotConnectedException) when calling the ExecuteCommand operation: The execute command failed due to an internal error. Try again later.

If I take the same service and I switch to NetworkMode: bridge it works, if I switch to LaunchType: FARGATE it also works. These 2 situations makes me assume the permissions are correct, and the issue is somewhere in the network configuration, but I cannot figure out where (I don't have much experience with networking in general).

This is the cloudformation stack I'm using to run this service:

AWSTemplateFormatVersion: 2010-09-09

Parameters:
    ClusterArn:
        Type: String
    VpcId:
        Type: String
    SubnetIds:
        Type: List<AWS::EC2::Subnet::Id>

Resources:
    AppService:
        Type: AWS::ECS::Service
        Properties:
            Cluster: !Ref ClusterArn
            DeploymentController:
                Type: ECS
            DesiredCount: 1
            EnableExecuteCommand: true
            LaunchType: EC2
            PropagateTags: SERVICE
            SchedulingStrategy: REPLICA
            TaskDefinition: !Ref TaskDefinition
            NetworkConfiguration:
                AwsvpcConfiguration:
                    Subnets: !Ref SubnetIds
                    SecurityGroups:
                        - !Ref ECSSecurityGroup

    ECSSecurityGroup:
        Type: AWS::EC2::SecurityGroup
        Properties:
            GroupDescription: "Security Group for ECS tasks allowing SSM access"
            VpcId: !Ref VpcId
            SecurityGroupIngress:
                - IpProtocol: "-1"  # -1 represents all protocols
                  CidrIp: 0.0.0.0/0
                - IpProtocol: "-1"
                  CidrIpv6: ::/0
            SecurityGroupEgress:
                - IpProtocol: "-1"
                  CidrIp: 0.0.0.0/0
                - IpProtocol: "-1"
                  CidrIpv6: ::/0

    TaskDefinition:
        Type: AWS::ECS::TaskDefinition
        Properties:
            ContainerDefinitions:
                -   Name: nginx
                    Essential: true
                    HealthCheck:
                        Command: [ "CMD-SHELL", "curl -f http://localhost || exit 1" ]
                    Image: nginx
                    LogConfiguration:
                        LogDriver: awslogs
                        Options:
                            awslogs-group: !Ref LogGroup
                            awslogs-region: !Ref AWS::Region
                            awslogs-stream-prefix: !Ref AWS::StackName
                    MemoryReservation: 128
                    PortMappings:
                        -   ContainerPort: 80
            ExecutionRoleArn: !Ref TaskExecutionRole
            Family: !Ref AWS::StackName
            RequiresCompatibilities:
                - EC2
            TaskRoleArn: !Ref TaskRole
            NetworkMode: awsvpc

    LogGroup:
        Type: AWS::Logs::LogGroup
        Properties:
            LogGroupName: !Join [ '-', [ !Ref AWS::StackName, 'logs' ] ]
            RetentionInDays: 30

    TaskExecutionRole:
        Type: AWS::IAM::Role
        Properties:
            Policies:
                -   PolicyName: TaskRolePolicy
                    PolicyDocument:
                        Version: 2012-10-17
                        Statement:
                            -   "Effect": "Allow"
                                "Action":
                                    - "ssm:StartSession"
                                    - "ssm:TerminateSession"
                                    - "ssm:DescribeSessions"
                                    - "ssm:GetConnectionStatus"
                                "Resource": "*"
                            -   "Effect": "Allow"
                                "Action":
                                    - "logs:CreateLogGroup"
                                    - "logs:CreateLogStream"
                                    - "logs:PutLogEvents"
                                "Resource": "*"

            AssumeRolePolicyDocument:
                Statement:
                    -   Action:
                            - sts:AssumeRole
                        Condition: { }
                        Effect: Allow
                        Principal:
                            Service: ecs-tasks.amazonaws.com
                Version: 2012-10-17
            ManagedPolicyArns:
                - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
                - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly

    TaskRole: # Based on https://github.com/aws/aws-cli/issues/6242#issuecomment-1079214960
        Type: AWS::IAM::Role
        Properties:
            AssumeRolePolicyDocument:
                Version: 2012-10-17
                Statement:
                    -   Action:
                            - sts:AssumeRole
                        Effect: Allow
                        Principal:
                            Service: ecs-tasks.amazonaws.com
            ManagedPolicyArns:
                - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
            Policies:
                -   PolicyName: TaskRolePolicy
                    PolicyDocument:
                        Version: 2012-10-17
                        Statement:
                            -   Effect: Allow
                                Action:
                                    - logs:DescribeLogGroups
                                Resource: '*'
                            -   Effect: Allow
                                Action:
                                    - logs:CreateLogStream
                                    - logs:DescribeLogStreams
                                    - logs:PutLogEvents
                                Resource: !Join [ '', [ 'arn:aws:logs:', !Ref AWS::Region, ':', !Ref AWS::AccountId , ':log-group:/aws/ecs/', !Ref AWS::StackName, ':*' ] ]

This is a sample service I created trying to isolate the problem.

Thank you and any ideas would be greatly appreciated.

asked 2 months ago66 views
2 Answers
1

The issue you're experiencing with the aws ecs execute-command failing when using EC2 launch type with awsvpc network mode is likely related to network configuration. When using awsvpc network mode with EC2 instances, there are some specific networking considerations to keep in mind.

  1. Public IP Assignment: Tasks using awsvpc network mode on EC2 instances are not automatically assigned public IP addresses. This means they don't have direct internet access, which can affect the ability to use execute-command.

  2. NAT Gateway Requirement: For tasks in private subnets to access the internet (which is necessary for execute-command), they need to be launched in a private subnet configured with a NAT gateway.

  3. VPC Endpoints: Ensure you have the necessary VPC endpoints set up. You need endpoints not just for Amazon ECS, but also for Systems Manager Session Manager.

To resolve this issue, consider the following steps:

  1. If your tasks need internet access, launch them in private subnets configured with a NAT gateway.

  2. Set up the required VPC endpoints, including one for Systems Manager Session Manager.

  3. Verify that your security group allows outbound traffic to the necessary AWS services.

  4. Double-check that the SSM agent is installed and running on your EC2 instances.

  5. Ensure that your task role has the necessary permissions for SSM and execute-command operations.

If you've confirmed all these settings and the issue persists, you may want to use the Amazon ECS Exec Checker tool to diagnose any remaining configuration issues.

Remember, when using awsvpc network mode with EC2 instances, inbound network access must come from within the VPC using private IP addresses or be routed through a load balancer within the VPC. Tasks launched in public subnets without these configurations may not have the necessary internet access for execute-command to function properly.
Sources
Tasks with public IP on EC2 backed ECS | AWS re:Post
Troubleshooting Amazon ECS TaskFailedToStart errors - Amazon Elastic Container Service
Error when try verify from the ECS instance | AWS re:Post
Troubleshoot Amazon ECS Exec issues - Amazon Elastic Container Service
ExecuteCommand - Amazon Elastic Container Service

profile picture
answered 2 months ago
profile picture
EXPERT
reviewed 2 months ago
0

Please check this doc for Amazon EC2 instances additional configuration for enabling ECS Exec

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task-iam-roles.html#task-iam-role-considerations

profile picture
answered 2 months ago
profile picture
EXPERT
reviewed 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions

Relevant content