Step Function GlueJob "startRunJob.sync" ignores retries by the glue job when determining success/failure of the task

0

We have a step function for orchestrating various Glue jobs. The Glue jobs have a built-in retry mechanism and we currently have them set to retry once. In the case where the job fails the first time but succeeds on the Glue Job retry (not the SFN task retry), the step function thinks the task has already failed.

Here's an example of the task as defined in SFN:

{
  "Type": "Task",
  "Resource": "arn:aws:states:::glue:startJobRun.sync",
  "Parameters": {
    "JobName.$": ...,
    "Arguments": {
       ...
    }
  },
  "Next": "Notify Success",
  "ResultPath": null,
  "Catch": [
    {
      "ErrorEquals": [
        "States.ALL"
      ],
      "Next": "Notify Failure"
    }
  ]
}

The job fails, and even has "Attempt": 0 in the cause field of the failure event. Is there a way to "Catch" on this? Or another method of having the step function wait for the Glue Job to complete its retries?

We could have the SFN manage all of the retries, but I'd rather not do that as there's a lot of delay between SFN<>Glue.

posta un anno fa222 visualizzazioni
1 Risposta
0

Normally when you use Step functions, you handle the retries in the state engine and not using the job built-in retries.
Also, doing it that way allows you better control (for instance, exponential back off)

profile pictureAWS
ESPERTO
con risposta un anno fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande