Ongoing service disruptions
For the most recent update on ongoing service disruptions affecting the AWS Middle East (UAE) Region (ME-CENTRAL-1), refer to the AWS Health Dashboard. For information on AWS Service migration, see How do I migrate my services to another region?
How to investigate ECS Services Stuck in DRAINING State with Service Connect TLS
This playbook provides a comprehensive guide for diagnosing and resolving Amazon ECS services stuck in DRAINING state, particularly when Service Connect with TLS is enabled
Issue Summary
Symptoms:
- ECS service remains in DRAINING status indefinitely
- Unable to delete service even with force delete command
- Error: "Create service is not idempotent" when attempting to recreate service
- Error: "Unable to Start a service that is still Draining"
- Service may not appear in Console or ListServices API but is visible via DescribeServices
Common Root Causes:
- Service Connect TLS infrastructure IAM role deleted or misconfigured
- Missing trust policy on the TLS role allowing ecs.amazonaws.com to assume it
- Missing required permissions on the TLS role
- KMS key deleted, disabled, or inaccessible
- Missing AmazonECSManaged tags on CloudMap/Service Discovery resources
- Infrastructure as Code (Terraform/CloudFormation) deleting roles before services
Important Note: These errors are almost always caused by misconfigured IAM roles. If you already know what the TLS infrastructure role is for your service, you can skip directly to the Resolution Steps section.
Diagnostic Steps
Step 1: Verify Service Status/ Associated tasks Run the following command to check the service details:
aws ecs describe-services \
--cluster <cluster-name> \
--services <service-name> \
--region <region>
What to look for:
- Service status showing "DRAINING"
- Service events containing error messages
- Note the service ARN and creation timestamp
- Check the serviceConnectConfiguration section for TLS role ARN(s) - you may find one or more roleArns listed here
CRITICAL: Check for "MISSING" Status First If the output shows fields marked as "MISSING", this indicates the associated cluster was deleted before the service was fully deleted. In this case:
- You can skip most of the diagnostic steps below
- The issue is almost certainly due to a misconfigured or deleted IAM role
- Proceed directly to Step 3 to identify the TLS role, then jump to Resolution Steps
Example Failure Events to Look For:
"events": [
{
"message": "AccessDeniedException: User: arn:aws:sts::123456789012:assumed-role/ECSInfrastructureTlsRole/ECSServiceConnectForTLS is not authorized to perform: secretsmanager:GetSecretValue"
},
{
"message": "InvalidParameterException: Unable to assume role arn:aws:iam::123456789012:role/ECSInfrastructureTlsRole"
}
]
Step 2: Review Service Events and Logs
Check the service events section in the DescribeServices output for:
- AccessDeniedException errors
- InvalidParameterException errors
- References to IAM roles or KMS keys
- sts:AssumeRole failures
Step 3: Identify the TLS Infrastructure Role Primary Method - Check Service Configuration: From the DescribeServices output, look in the serviceConnectConfiguration section for the TLS role ARN(s):
"serviceConnectConfiguration": {
"services": [
{
"tls": {
"issuerCertificateAuthority": {
"awsPcaAuthorityArn": "string"
},
"kmsKey": "string",
"roleArn": "string"
}
}
]
}
Reference: ServiceConnectTlsConfiguration-roleArn
Important: One ECS Service can have multiple TLS roleArns (though uncommon). Make sure to verify that ALL of them exist and have the correct policies.
Alternative Method - Use CloudTrail: If you cannot describe the service or need to identify the role used:
- To find permissions issues:
- Filter CloudTrail by User name = ECSServiceConnectForTLS
- Look for calls made by the infrastructure role to KMS/SecretsManager/PCA
- Failed calls will indicate permission problems
- To find trust policy issues:
- Filter CloudTrail by Resource name = ECSServiceConnectForTLS
- Look for AssumeRole requests
- Failed AssumeRole calls indicate trust policy or role existence issues
Using these CloudTrail techniques, you should only need to contact support if you truly cannot determine what role was configured for your service.
Step 4: Verify IAM Role Existence
aws iam get-role --role-name <role-name>
If the role doesn't exist: Proceed to Resolution Steps If the role exists: Verify trust policy and permissions in **Step 5 **
Step 5: Check Trust Policy
aws iam get-role --role-name <role-name> --query 'Role.AssumeRolePolicyDocument'
Verify the trust policy allows ecs.amazonaws.com to assume the role.
Step 6: Verify Attached Policies
aws iam list-attached-role-policies --role-name <role-name>
CRITICAL: Confirm the latest version of the AmazonECSInfrastructureRolePolicyForServiceConnectTransportLayerSecurity managed policy is attached. Older versions of this policy will not resolve the issue.
Step 7: Check KMS Key Status (if applicable) If the service uses a KMS key for encryption:
aws kms describe-key --key-id <key-id>
Verify the key is enabled and the IAM role has permissions to use it.
Step 8: Review CloudTrail Logs Search for recent events related to AccessDenied errors from
- SecretsManager
- KMS
- ACM-PCA
Reference: AmazonECSInfrastructureRolePolicyForServiceConnectTransportLayerSecurity
Use the CloudTrail filtering techniques from Step 3:
- Filter by User name = ECSServiceConnectForTLS to identify permission issues with KMS, Secrets Manager, or PCA
- Filter by Resource name = ECSServiceConnectForTLS to identify AssumeRole failures indicating trust policy or role existence problems
Resolution Steps
Step 1: Recreate the TLS Infrastructure Role (if deleted) CRITICAL: The role must have the exact same name as originally configured.
Create the trust policy document (trust-policy.json):
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowAccessToECSForInfrastructureManagement",
"Effect": "Allow",
"Principal": {
"Service": "ecs.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
Create the role:
aws iam create-role \
--role-name <original-role-name> \
--assume-role-policy-document file://trust-policy.json
Step 2: Attach Required Managed Policy
aws iam attach-role-policy \
--role-name <role-name> \
--policy-arn arn:aws:iam::aws:policy/AmazonECSInfrastructureRolePolicyForServiceConnectTransportLayerSecurity
Note: Use the latest version of this managed policy for the most current permissions.
Reference: AmazonECSInfrastructureRolePolicyForServiceConnectTransportLayerSecurity
Step 3: Fix Trust Policy (if role exists but misconfigured) Update the trust policy:
aws iam update-assume-role-policy \
--role-name <role-name> \
--policy-document file://trust-policy.json
Step 4: Restore KMS Key (if deleted or disabled) If the KMS key was deleted:
aws kms enable-key --key-id <key-id>
Ensure the IAM role has KMS permissions:
aws kms put-key-policy \
--key-id <key-id> \
--policy-name default \
--policy file://kms-policy.json
Example: kms-policy.json
{
"Version":"2012-10-17",
"Statement": [
{
"Sid": "id",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::111122223333:role/role-name"
},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:GenerateDataKey",
"kms:GenerateDataKeyPair"
],
"Resource": "*"
}
]
}
Reference: KMS policy
Step 5: Fix CloudMap/Service Discovery Tags (if applicable) If errors mention Service Discovery or CloudMap:
aws servicediscovery tag-resource \
--resource-arn <service-discovery-arn> \
--tags Key=AmazonECSManaged,Value=true
Verify the AWSServiceRoleForECS service-linked role exists:
aws iam get-role --role-name AWSServiceRoleForECS
Step 6: Wait for Automatic Cleanup
Once the IAM role is correctly configured:
- ECS will automatically retry the cleanup operations
- Service will transition from DRAINING → INACTIVE (typically within minutes)
- Once the service reaches INACTIVE status, you can immediately recreate a service with the same name
Step 7: Verify Resolution
Monitor the service status:
aws ecs describe-services \
--cluster <cluster-name> \
--services <service-name> \
--region <region> \
--query 'services[0].status'
Check service events for successful cleanup messages.
Prevention Best Practices
For Infrastructure as Code Users
Terraform: Add explicit dependencies to ensure services are deleted before IAM roles:
resource "aws_ecs_service" "example" {
# ... service configuration ...
depends_on = [aws_iam_role.ecs_tls_role]
}
CloudFormation: Use DependsOn attribute:
ECSService:
Type: AWS::ECS::Service
DependsOn: ECSInfrastructureTLSRole
General Recommendations
- Never delete infrastructure IAM roles while services are active
- Implement IAM role deletion protection for critical infrastructure roles
- Monitor CloudTrail for unauthorized IAM role deletions
- Regular audits of IAM roles used by ECS services
- Document role names and configurations in runbooks
- Test deletion order in non-production environments first
Additional Resources
- AWS Documentation: Service Connect TLS Configuration
- IAM Policy Reference: AmazonECSInfrastructureRolePolicyForServiceConnectTransportLayerSecurity
- ECS Service States: Service Lifecycle Task Status Pending:
- Terraform: depends_on
- Cloudformation: DependsOn
Relevant content
- Accepted Answerasked 5 months ago
- asked a month ago
AWS OFFICIALUpdated 2 years ago
AWS OFFICIALUpdated a year ago