Why are ECS Container Instances No Longer Able to Register With ECS Cluster?

0

We have several EC2 / ECS clusters with ASG capacity providers and managed scaling. Somewhere between Dec 7, 2022 12:30am - ~7am US Pacific time, new instances being provisioned by managed scaling do not register with their ECS clusters.

We had a successful deploy completing at ~12:30am, which was performed via a CloudFormation stack update, and turned over the live container instances successfully, per our capacity provider rules. The ASG provisioned additional EC2 instances mid morning in response to managed scaling signals, but they never registered with the cluster.

Is anyone else experiencing problems with this? What could explain a regression in the absence of any CF resource changes?

What I tried:

I used the AWSSupport-TroubleshootECSContainerInstance notebook in AWS Systems Manager on one of the instances that failed to register, per the troubleshooting docs. It reported back that it lacked all required permissions on the IAM instance profile. The same instance profile is attached to the instances that DID successfully register a little after midnight, and there have been no changes to the permissions since.

Specifically, the AWSSupport-TroubleshootECSContainerInstance automation reported back the following:

{"Payload":{"stdout":"The container instance profile <REDACTED> is missing the following required permission(s): 
['ecs:RegisterContainerInstance', 'ecs:CreateCluster', 'ecs:DeregisterContainerInstance', 'ecs:Poll', 'ecs:StartTelemetrySession', 'ecs:UpdateContainerInstancesState', 'ecs:SubmitAttachmentStateChange', 'ecs:SubmitContainerStateChange', 'ecs:SubmitTaskStateChange']
Make sure that the container instance has all the recommended permissions.
See https://docs.aws.amazon.com/AmazonECS/latest/developerguide/security-iam-awsmanpol.html#instance-iam-role-permissions

","info_codes":["I002"]}}

When I examined the instance profile in question it had the following permissions, scoped to the correct cluster, as generated by CDK v2:

  • ecs:RegisterContainerInstance
  • ecs:DeregisterContainerInstance
  • ecs:Poll
  • ecs:StartTelemerySession
  • ecs:Submit*

It lacked the following two permissions altogether:

  • ecs:CreateCluster
  • ecs:UpdateContainerInstancesState

Eventually I was able to resolve the IAM problems reported by AWSSupport-TroubleshootECSContainerInstance by:

  1. Adding the two missing permissions
  2. Changing the scope of the ones that I had from the cluster in question, to "Resource": "*"

This seems like a bug pushed by the ECS team to me. It does not make sense why a previously functioning instance profile stopped working. The new requirement of "ecs:CreateCluster" is especially suspicious -- why would a container instance need to create a cluster? The requirement to have Resource access beyond the target cluster is also extremely suspicious.

Regardless, new EC2 instances still do not register with their cluster. The AWSSupport-TroubleshootECSContainerInstance automation now comes up blank, stating:

{"Payload":{"stdout":"Unable to identify the cause of issue.
If you are still experiencing issues while registering <REDACTED> in the cluster named vila-v0-prod, please open a case with Premium Support and attach the logs generated by the ECS Logs Collector script.
See: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-logs-collector.html
If it is not possible to upload the file while case creation, please ask the assigned engineer to provide the instruction to upload the file.

","info_codes":["I000"]}}
asked a year ago172 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions