ECS service deployment stuck in deadlock due to deleted target group

1

Hi, we recently came across a problem with ECS service deployments, which is in our view a lack of robustness. Our setup roughly looks like this: We have an ECS service, which is reachable via different domains, which may change (even if not often). Due to technical reasons, the requests for the different domains are routed to the service task containers via separate target groups. Changes to the service are done with a deployment configuration allowing a minium of 100% and a maximum of 200%. In the automation, when switching the domains, a target group associated with the service might be deleted before the deployment has deregistered the existing container targets. As a result, the deployment is stuck in a state, where it can't remove the old task anymore. This can be observed in CloudTrail:

{
    ...
    "eventSource": "elasticloadbalancing.amazonaws.com",
    "eventName": "DescribeTargetGroups",
    "awsRegion": "eu-central-1",
    "sourceIPAddress": "ecs.amazonaws.com",
    "userAgent": "ecs.amazonaws.com",
    "errorCode": "TargetGroupNotFoundException",
    "errorMessage": "One or more target groups not found",
    ...
}

We are aware that our solution should handle this situation better, i.e. the target groups should not be deleted too early and we are already looking into this. However we were a bit surprised, that the deployment was completely stuck in this case, blocking all subsequent deployments due to the min/max configuration. Could this be handled in a more robust way on AWS side? And any suggestions how to handle this in our automation? We would not like to have a "polling configuration" waiting for the service to be in steady state with each change as we would like to keep this async.

Thanks in advance

  • Update: It seems like ECS handles this error correctly, if only one target group is wired to the ECS service. It will print the error event in the events view, but does not get stuck. But if there are e.g. two target groups attached and one of these is deleted prematurely, the situation described above occurs.

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions