Skip to content

How to investigate ECS Services Stuck in DRAINING State with Service Connect TLS

7 minute read
Content level: Intermediate
2

This playbook provides a comprehensive guide for diagnosing and resolving Amazon ECS services stuck in DRAINING state, particularly when Service Connect with TLS is enabled

Issue Summary

Symptoms:

  • ECS service remains in DRAINING status indefinitely
  • Unable to delete service even with force delete command
  • Error: "Create service is not idempotent" when attempting to recreate service
  • Error: "Unable to Start a service that is still Draining"
  • Service may not appear in Console or ListServices API but is visible via DescribeServices

Common Root Causes:

  • Service Connect TLS infrastructure IAM role deleted or misconfigured
  • Missing trust policy on the TLS role allowing ecs.amazonaws.com to assume it
  • Missing required permissions on the TLS role
  • KMS key deleted, disabled, or inaccessible
  • Missing AmazonECSManaged tags on CloudMap/Service Discovery resources
  • Infrastructure as Code (Terraform/CloudFormation) deleting roles before services

Important Note: These errors are almost always caused by misconfigured IAM roles. If you already know what the TLS infrastructure role is for your service, you can skip directly to the Resolution Steps section.

Diagnostic Steps

Step 1: Verify Service Status/ Associated tasks Run the following command to check the service details:

aws ecs describe-services \ 
 --cluster <cluster-name> \ 
 --services <service-name> \ 
 --region <region> 

What to look for:

  • Service status showing "DRAINING"
  • Service events containing error messages
  • Note the service ARN and creation timestamp
  • Check the serviceConnectConfiguration section for TLS role ARN(s) - you may find one or more roleArns listed here

CRITICAL: Check for "MISSING" Status First If the output shows fields marked as "MISSING", this indicates the associated cluster was deleted before the service was fully deleted. In this case:

  • You can skip most of the diagnostic steps below
  • The issue is almost certainly due to a misconfigured or deleted IAM role
  • Proceed directly to Step 3 to identify the TLS role, then jump to Resolution Steps

Example Failure Events to Look For:

"events": [
  {
    "message": "AccessDeniedException: User: arn:aws:sts::123456789012:assumed-role/ECSInfrastructureTlsRole/ECSServiceConnectForTLS is not authorized to perform: secretsmanager:GetSecretValue"
  },
  {
    "message": "InvalidParameterException: Unable to assume role arn:aws:iam::123456789012:role/ECSInfrastructureTlsRole"
  }
]

Step 2: Review Service Events and Logs

Check the service events section in the DescribeServices output for:

  • AccessDeniedException errors
  • InvalidParameterException errors
  • References to IAM roles or KMS keys
  • sts:AssumeRole failures

Step 3: Identify the TLS Infrastructure Role Primary Method - Check Service Configuration: From the DescribeServices output, look in the serviceConnectConfiguration section for the TLS role ARN(s):

"serviceConnectConfiguration": {
  "services": [
    {
      "tls": {
        "issuerCertificateAuthority": {
          "awsPcaAuthorityArn": "string"
        },
        "kmsKey": "string",
        "roleArn": "string"
      }
    }
  ]
}

Reference: ServiceConnectTlsConfiguration-roleArn

Important: One ECS Service can have multiple TLS roleArns (though uncommon). Make sure to verify that ALL of them exist and have the correct policies.

Alternative Method - Use CloudTrail: If you cannot describe the service or need to identify the role used:

  1. To find permissions issues:
    • Filter CloudTrail by User name = ECSServiceConnectForTLS
    • Look for calls made by the infrastructure role to KMS/SecretsManager/PCA
    • Failed calls will indicate permission problems
  2. To find trust policy issues:
    • Filter CloudTrail by Resource name = ECSServiceConnectForTLS
    • Look for AssumeRole requests
    • Failed AssumeRole calls indicate trust policy or role existence issues

Using these CloudTrail techniques, you should only need to contact support if you truly cannot determine what role was configured for your service.

Step 4: Verify IAM Role Existence

aws iam get-role --role-name <role-name> 

If the role doesn't exist: Proceed to Resolution Steps If the role exists: Verify trust policy and permissions in **Step 5 **

Step 5: Check Trust Policy

aws iam get-role --role-name <role-name> --query 'Role.AssumeRolePolicyDocument' 

Verify the trust policy allows ecs.amazonaws.com to assume the role.

Step 6: Verify Attached Policies

aws iam list-attached-role-policies --role-name <role-name> 

CRITICAL: Confirm the latest version of the AmazonECSInfrastructureRolePolicyForServiceConnectTransportLayerSecurity managed policy is attached. Older versions of this policy will not resolve the issue.

Step 7: Check KMS Key Status (if applicable) If the service uses a KMS key for encryption:

aws kms describe-key --key-id <key-id> 

Verify the key is enabled and the IAM role has permissions to use it.

Step 8: Review CloudTrail Logs Search for recent events related to AccessDenied errors from

  • SecretsManager
  • KMS
  • ACM-PCA

Reference: AmazonECSInfrastructureRolePolicyForServiceConnectTransportLayerSecurity

Use the CloudTrail filtering techniques from Step 3:

  • Filter by User name = ECSServiceConnectForTLS to identify permission issues with KMS, Secrets Manager, or PCA
  • Filter by Resource name = ECSServiceConnectForTLS to identify AssumeRole failures indicating trust policy or role existence problems

Resolution Steps

Step 1: Recreate the TLS Infrastructure Role (if deleted) CRITICAL: The role must have the exact same name as originally configured.

Create the trust policy document (trust-policy.json):

{ 
 "Version": "2012-10-17", 
 "Statement": [ 
   { 
     "Sid": "AllowAccessToECSForInfrastructureManagement", 
     "Effect": "Allow", 
     "Principal": { 
       "Service": "ecs.amazonaws.com" 
     }, 
     "Action": "sts:AssumeRole" 
   } 
 ] 
} 

Create the role:

aws iam create-role \ 
 --role-name <original-role-name> \ 
 --assume-role-policy-document file://trust-policy.json 

Step 2: Attach Required Managed Policy

aws iam attach-role-policy \ 
 --role-name <role-name> \ 
 --policy-arn arn:aws:iam::aws:policy/AmazonECSInfrastructureRolePolicyForServiceConnectTransportLayerSecurity 

Note: Use the latest version of this managed policy for the most current permissions.

Reference: AmazonECSInfrastructureRolePolicyForServiceConnectTransportLayerSecurity

Step 3: Fix Trust Policy (if role exists but misconfigured) Update the trust policy:

aws iam update-assume-role-policy \ 
 --role-name <role-name> \ 
 --policy-document file://trust-policy.json 

Step 4: Restore KMS Key (if deleted or disabled) If the KMS key was deleted:

aws kms enable-key --key-id <key-id> 

Ensure the IAM role has KMS permissions:

aws kms put-key-policy \ 
 --key-id <key-id> \ 
 --policy-name default \ 
 --policy file://kms-policy.json 

Example: kms-policy.json

{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "id",
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111122223333:role/role-name"
      },
      "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:GenerateDataKey",
        "kms:GenerateDataKeyPair"
      ],
      "Resource": "*"
    }
  ]
}

Reference: KMS policy

Step 5: Fix CloudMap/Service Discovery Tags (if applicable) If errors mention Service Discovery or CloudMap:

aws servicediscovery tag-resource \ 
 --resource-arn <service-discovery-arn> \ 
 --tags Key=AmazonECSManaged,Value=true 

Verify the AWSServiceRoleForECS service-linked role exists:

aws iam get-role --role-name AWSServiceRoleForECS 

Step 6: Wait for Automatic Cleanup

Once the IAM role is correctly configured:

  1. ECS will automatically retry the cleanup operations
  2. Service will transition from DRAININGINACTIVE (typically within minutes)
  3. Once the service reaches INACTIVE status, you can immediately recreate a service with the same name

Step 7: Verify Resolution

Monitor the service status:

aws ecs describe-services \ 
 --cluster <cluster-name> \ 
 --services <service-name> \ 
 --region <region> \ 
 --query 'services[0].status' 

Check service events for successful cleanup messages.

Prevention Best Practices

For Infrastructure as Code Users

Terraform: Add explicit dependencies to ensure services are deleted before IAM roles:

resource "aws_ecs_service" "example" { 
 # ... service configuration ... 
  
 depends_on = [aws_iam_role.ecs_tls_role] 
} 

CloudFormation: Use DependsOn attribute:

ECSService: 
 Type: AWS::ECS::Service 
 DependsOn: ECSInfrastructureTLSRole 

General Recommendations

  • Never delete infrastructure IAM roles while services are active
  • Implement IAM role deletion protection for critical infrastructure roles
  • Monitor CloudTrail for unauthorized IAM role deletions
  • Regular audits of IAM roles used by ECS services
  • Document role names and configurations in runbooks
  • Test deletion order in non-production environments first

Additional Resources