Step Function GlueJob "startRunJob.sync" ignores retries by the glue job when determining success/failure of the task

0

We have a step function for orchestrating various Glue jobs. The Glue jobs have a built-in retry mechanism and we currently have them set to retry once. In the case where the job fails the first time but succeeds on the Glue Job retry (not the SFN task retry), the step function thinks the task has already failed.

Here's an example of the task as defined in SFN:

{
  "Type": "Task",
  "Resource": "arn:aws:states:::glue:startJobRun.sync",
  "Parameters": {
    "JobName.$": ...,
    "Arguments": {
       ...
    }
  },
  "Next": "Notify Success",
  "ResultPath": null,
  "Catch": [
    {
      "ErrorEquals": [
        "States.ALL"
      ],
      "Next": "Notify Failure"
    }
  ]
}

The job fails, and even has "Attempt": 0 in the cause field of the failure event. Is there a way to "Catch" on this? Or another method of having the step function wait for the Glue Job to complete its retries?

We could have the SFN manage all of the retries, but I'd rather not do that as there's a lot of delay between SFN<>Glue.

preguntada hace un año222 visualizaciones
1 Respuesta
0

Normally when you use Step functions, you handle the retries in the state engine and not using the job built-in retries.
Also, doing it that way allows you better control (for instance, exponential back off)

profile pictureAWS
EXPERTO
respondido hace un año

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas