Intermittent health check timeouts causing ECS to kill tasks

0

We have an ECS service running our API. Normally this service runs with ~12 tasks. The service is configured with an HTTP health check that returns a 200 if certain conditions are met - usually this returns within ~200ms. We have a scaling policy that starts new tasks based on the average CPU of our tasks.

Recently we have seen that ECS is terminating a large chunk of tasks at a time (often ~50% of the tasks) and then our service drops requests as we don't have the capacity to handle the inbound requests. I have noticed, at least on the most recent occurrence, that we had a spike in traffic of about 40% of our current traffic around the time that ECS terminated a bunch of tasks, however, the capacity should be there in our API to handle this without any issue. This issue has happened ~5 times in the past week or so but is very intermittent and doesn't seem to affect the entire service - only certain tasks.

I have checked all of our monitoring and logging and I can't see anything as to why the health check would be failing. The application logs for the tasks are completely normal. All I have to go on are the following messages in the ECS event log:

service my-service (port 8000) is unhealthy in target-group my-target-group due to (reason Request timed out).

Is there any further troubleshooting I can do to understand what is causing this? Also, if the issue is somehow triggered by an increase in load, is there a way we can prevent the ECS service from immediately terminating the tasks (which inevitably compounds the issue).

Aucune réponse

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions