How can I detect why a Fargate RunTask triggered by EventBridge rule fails

0

We use EventBridge to trigger jobs in Fargate. This has been working well for a long time. Lately, it seems like starting the task in Fargate sometimes fails silently.

We run thousands of these jobs and failures seem to be totally random and rare. I have done some digging in CloudTrail and I see that RunTask is executed. There is no corresponding CreateLogStream after RunTask and there are of course not any logs in CloudWatch in this case either.

Since this happens rarely, I have not been able to look at a stopped task in Fargate since they tend to be cleaned up rapidly, but I'm on the lookout.

I have seen this happen when we have been way below our quota in Fargate so it should not be connected to any service quotas.

  • I have been able to inspect the job in the console and found stopped reason "ResourceInitializationError: failed to configure ENI: failed to setup regular eni: netplugin failed with no error message". This doesn't solve the problem since this still silently fails.

Knut
已提问 2 年前263 查看次数
1 回答
0
已接受的回答

Hello Knut,

The error message ResourceInitializationError: failed to configure ENI could be due to a transient issue within the Fargate workflow. If this Fargate task was part of an ECS service, then the ECS Service Scheduler would have attempted to re-launch the task automatically.

However, when EventBridge launches an ECS task, it performs the RunTask API operation to trigger the creation of a new task. Starting a task through the RunTask API involves an asynchronous workflow.

If the workflow started successfully, then a success code is returned. However, this doesn't mean that the task is in RUNNING state. The RunTask caller is expected to verify if the task reaches Running state, and if that does not happen, the caller needs to retry the operation.

Reattempts can be automated with an exponential backoff and retry logic by using AWS Step Functions.

Here is a knowledge-center article that explains how to use Step Functions to implement the retry-backoff functionality to mitigate your problem.

I hope this is helpful to you. Please add a comment if you have any concerns with this approach.

Thank you!

profile pictureAWS
支持工程师
已回答 2 年前
profile picture
专家
已审核 5 个月前
  • Thanks for the response, Venkat. It is helpful in that it tells me that no one should ever use EventBridge rules to trigger Fargate tasks. Since all our Fargate jobs are managed through a controller to avoid hitting service quotas; we can at least implement retry there instead of the places we would have otherwise done this. I'm currently worried about which Step Functions quotas we will be struggling with if we choose that solution.

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则