Step Function GlueJob "startRunJob.sync" ignores retries by the glue job when determining success/failure of the task

0

We have a step function for orchestrating various Glue jobs. The Glue jobs have a built-in retry mechanism and we currently have them set to retry once. In the case where the job fails the first time but succeeds on the Glue Job retry (not the SFN task retry), the step function thinks the task has already failed.

Here's an example of the task as defined in SFN:

{
  "Type": "Task",
  "Resource": "arn:aws:states:::glue:startJobRun.sync",
  "Parameters": {
    "JobName.$": ...,
    "Arguments": {
       ...
    }
  },
  "Next": "Notify Success",
  "ResultPath": null,
  "Catch": [
    {
      "ErrorEquals": [
        "States.ALL"
      ],
      "Next": "Notify Failure"
    }
  ]
}

The job fails, and even has "Attempt": 0 in the cause field of the failure event. Is there a way to "Catch" on this? Or another method of having the step function wait for the Glue Job to complete its retries?

We could have the SFN manage all of the retries, but I'd rather not do that as there's a lot of delay between SFN<>Glue.

asked a year ago215 views
1 Answer
0

Normally when you use Step functions, you handle the retries in the state engine and not using the job built-in retries.
Also, doing it that way allows you better control (for instance, exponential back off)

profile pictureAWS
EXPERT
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions