ECS fails to remove a task from the load balancer target group?


We recently migrated our service to ECS, and we’ve seen a pattern of errors like this a few times:

  1. Our CPU usage is low, so as part of normal autoscaling, ECS starts to reduce the number of tasks by 1
  2. ECS claims to have deregistered 1 target and be draining connections
  3. 90 seconds later (after the deregistration delay) ECS stops the task
  4. Immediately after the task is stopped, a flood of load balancer 502s happens, all directed at just one IP. We suspect that this is the IP of the task that was removed and stopped, but somehow not removed from the ELB target group

We don’t have any long-lived connections, so the 90 second deregistration delay should be long enough for the task to finish processing its requests before it’s stopped.

It seems that the task selected for removal isn’t actually removed from the ALB target group, even though the ECS logs include messages indicating that it is. The logs include

* service \[our-service\] deregistered 1 targets in target-group \[our-target-group\]
followed by
* service \[our-service\] has begun draining connections on 1 tasks.

Events like this happen rarely (definitely not every time we scale down) but frequently enough to notice. Does anybody have ideas about why these errors might be happening, or how to get more information about what is going on? Thanks in advance.

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions