Intermittent health check timeouts causing ECS to kill tasks

0

We have an ECS service running our API. Normally this service runs with ~12 tasks. The service is configured with an HTTP health check that returns a 200 if certain conditions are met - usually this returns within ~200ms. We have a scaling policy that starts new tasks based on the average CPU of our tasks.

Recently we have seen that ECS is terminating a large chunk of tasks at a time (often ~50% of the tasks) and then our service drops requests as we don't have the capacity to handle the inbound requests. I have noticed, at least on the most recent occurrence, that we had a spike in traffic of about 40% of our current traffic around the time that ECS terminated a bunch of tasks, however, the capacity should be there in our API to handle this without any issue. This issue has happened ~5 times in the past week or so but is very intermittent and doesn't seem to affect the entire service - only certain tasks.

I have checked all of our monitoring and logging and I can't see anything as to why the health check would be failing. The application logs for the tasks are completely normal. All I have to go on are the following messages in the ECS event log:

service my-service (port 8000) is unhealthy in target-group my-target-group due to (reason Request timed out).

Is there any further troubleshooting I can do to understand what is causing this? Also, if the issue is somehow triggered by an increase in load, is there a way we can prevent the ECS service from immediately terminating the tasks (which inevitably compounds the issue).

No hay respuestas

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas