Last week, we started having problems after an external service we depend on crashed. We noticed that our ECS service was still not working after the external service fixed its problem. We struggled to find the reason why one of our ECS service was having troubles maintaining its tasks up and running. A simple "forced" redeployment with a single configuration change setting the number of tasks from 2 to 5 fixed the problem.
The next day, we tried to figure out what happened. It seems the ECS service was continuously killing the task because it was not able to reach its heathcheck endpoint. This healthcheck returns a simple 200 response without any dependency to external services. In CloudWatch, we can see logs of the process manager starting the application and logs of the server starting listening.
The problem occurred from May 8th at 8:30 PM to May 9th at 6:45 PM. Before the incident, the ECS service was running correctly since several days. The load was not different than usual. Performing a "forced redeployment" without changing the container image or any other configuration but the number of tasks fixed the issue. No more healthcheck problem at all. So we don't understand why the service was unstable before.
Do you please have any hint about what could have caused this behavior ?