Intermittent health check timeouts causing ECS to kill tasks


We have an ECS service running our API. Normally this service runs with ~12 tasks. The service is configured with an HTTP health check that returns a 200 if certain conditions are met - usually this returns within ~200ms. We have a scaling policy that starts new tasks based on the average CPU of our tasks.

Recently we have seen that ECS is terminating a large chunk of tasks at a time (often ~50% of the tasks) and then our service drops requests as we don't have the capacity to handle the inbound requests. I have noticed, at least on the most recent occurrence, that we had a spike in traffic of about 40% of our current traffic around the time that ECS terminated a bunch of tasks, however, the capacity should be there in our API to handle this without any issue. This issue has happened ~5 times in the past week or so but is very intermittent and doesn't seem to affect the entire service - only certain tasks.

I have checked all of our monitoring and logging and I can't see anything as to why the health check would be failing. The application logs for the tasks are completely normal. All I have to go on are the following messages in the ECS event log:

service my-service (port 8000) is unhealthy in target-group my-target-group due to (reason Request timed out).

Is there any further troubleshooting I can do to understand what is causing this? Also, if the issue is somehow triggered by an increase in load, is there a way we can prevent the ECS service from immediately terminating the tasks (which inevitably compounds the issue).

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions