How can I detect why a Fargate RunTask triggered by EventBridge rule fails

Question

We use EventBridge to trigger jobs in Fargate. This has been working well for a long time. Lately, it seems like starting the task in Fargate sometimes fails silently.

We run thousands of these jobs and failures seem to be totally random and rare. I have done some digging in CloudTrail and I see that RunTask is executed. There is no corresponding CreateLogStream after RunTask and there are of course not any logs in CloudWatch in this case either.

Since this happens rarely, I have not been able to look at a stopped task in Fargate since they tend to be cleaned up rapidly, but I'm on the lookout.

I have seen this happen when we have been way below our quota in Fargate so it should not be connected to any service quotas.

Accepted Answer

Hello Knut,

The error message `ResourceInitializationError: failed to configure ENI` could be due to a transient issue within the Fargate workflow. If this Fargate task was part of an ECS service, then the ECS Service Scheduler would have attempted to re-launch the task automatically.

However, when EventBridge launches an ECS task, it performs the *RunTask* API operation to trigger the creation of a new task. Starting a task through the *RunTask* API involves an asynchronous workflow.

If the workflow started successfully, then a success code is returned. However, this doesn't mean that the task is in RUNNING state. The *RunTask* caller is expected to verify if the task reaches Running state, and if that does not happen, the caller needs to retry the operation.

Reattempts can be automated with an exponential backoff and retry logic by using [AWS Step Functions](https://aws.amazon.com/step-functions/).

Here is a knowledge-center [article](https://aws.amazon.com/premiumsupport/knowledge-center/ecs-fargate-network-interface-errors/) that explains how to use Step Functions to implement the retry-backoff functionality to mitigate your problem.

I hope this is helpful to you. Please add a comment if you have any concerns with this approach.

Thank you!

How can I detect why a Fargate RunTask triggered by EventBridge rule fails

相关内容