Intermittent health check timeouts causing ECS to kill tasks

We have an ECS service running our API. Normally this service runs with ~12 tasks. The service is configured with an HTTP health check that returns a 200 if certain conditions are met - usually this returns within ~200ms. We have a scaling policy that starts new tasks based on the average CPU of our tasks.

Recently we have seen that ECS is terminating a large chunk of tasks at a time (often ~50% of the tasks) and then our service drops requests as we don't have the capacity to handle the inbound requests. I have noticed, at least on the most recent occurrence, that we had a spike in traffic of about 40% of our current traffic around the time that ECS terminated a bunch of tasks, however, the capacity should be there in our API to handle this without any issue. This issue has happened ~5 times in the past week or so but is very intermittent and doesn't seem to affect the entire service - only certain tasks.

I have checked all of our monitoring and logging and I can't see anything as to why the health check would be failing. The application logs for the tasks are completely normal. All I have to go on are the following messages in the ECS event log:

service my-service (port 8000) is unhealthy in target-group my-target-group due to (reason Request timed out).

Is there any further troubleshooting I can do to understand what is causing this? Also, if the issue is somehow triggered by an increase in load, is there a way we can prevent the ECS service from immediately terminating the tasks (which inevitably compounds the issue).

Temas

Contenedores Redes y entrega de contenido Microservicios

Etiquetas

Amazon Elastic Container Service Balanceo de carga elástica Microservicios

Idioma

English

mefs

preguntada hace 2 años134 visualizaciones

No hay respuestas

Más nuevo
Más votos
Más comentarios

Contenido relevante

¿Cómo soluciono el error «The managed termination protection setting for the capacity provider is invalid» en Amazon ECS?
OFICIAL DE AWSActualizada hace 3 años
¿Cómo soluciono los errores del equilibrador de carga para las tareas de Amazon ECS en Fargate?
OFICIAL DE AWSActualizada hace 3 años
¿Cómo solicito un evento de AWS Health para probar la integración con AWS Health?
OFICIAL DE AWSActualizada hace un mes
¿Cómo puedo resolver el error “An error occurred (TargetNotConnectedException) when calling the ExecuteCommand operation” (Se produjo un error (TargetNotConnectedException) al llamar a la operación ExecuteCommand) en Amazon Elastic Container Service (Amazon ECS)?
OFICIAL DE AWSActualizada hace 2 años