How can I detect why a Fargate RunTask triggered by EventBridge rule fails

0

We use EventBridge to trigger jobs in Fargate. This has been working well for a long time. Lately, it seems like starting the task in Fargate sometimes fails silently.

We run thousands of these jobs and failures seem to be totally random and rare. I have done some digging in CloudTrail and I see that RunTask is executed. There is no corresponding CreateLogStream after RunTask and there are of course not any logs in CloudWatch in this case either.

Since this happens rarely, I have not been able to look at a stopped task in Fargate since they tend to be cleaned up rapidly, but I'm on the lookout.

I have seen this happen when we have been way below our quota in Fargate so it should not be connected to any service quotas.

  • I have been able to inspect the job in the console and found stopped reason "ResourceInitializationError: failed to configure ENI: failed to setup regular eni: netplugin failed with no error message". This doesn't solve the problem since this still silently fails.

Knut
demandé il y a 2 ans263 vues
1 réponse
0
Réponse acceptée

Hello Knut,

The error message ResourceInitializationError: failed to configure ENI could be due to a transient issue within the Fargate workflow. If this Fargate task was part of an ECS service, then the ECS Service Scheduler would have attempted to re-launch the task automatically.

However, when EventBridge launches an ECS task, it performs the RunTask API operation to trigger the creation of a new task. Starting a task through the RunTask API involves an asynchronous workflow.

If the workflow started successfully, then a success code is returned. However, this doesn't mean that the task is in RUNNING state. The RunTask caller is expected to verify if the task reaches Running state, and if that does not happen, the caller needs to retry the operation.

Reattempts can be automated with an exponential backoff and retry logic by using AWS Step Functions.

Here is a knowledge-center article that explains how to use Step Functions to implement the retry-backoff functionality to mitigate your problem.

I hope this is helpful to you. Please add a comment if you have any concerns with this approach.

Thank you!

profile pictureAWS
INGÉNIEUR EN ASSISTANCE TECHNIQUE
répondu il y a 2 ans
profile picture
EXPERT
vérifié il y a 5 mois
  • Thanks for the response, Venkat. It is helpful in that it tells me that no one should ever use EventBridge rules to trigger Fargate tasks. Since all our Fargate jobs are managed through a controller to avoid hitting service quotas; we can at least implement retry there instead of the places we would have otherwise done this. I'm currently worried about which Step Functions quotas we will be struggling with if we choose that solution.

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions