Step Function GlueJob "startRunJob.sync" ignores retries by the glue job when determining success/failure of the task

0

We have a step function for orchestrating various Glue jobs. The Glue jobs have a built-in retry mechanism and we currently have them set to retry once. In the case where the job fails the first time but succeeds on the Glue Job retry (not the SFN task retry), the step function thinks the task has already failed.

Here's an example of the task as defined in SFN:

{
  "Type": "Task",
  "Resource": "arn:aws:states:::glue:startJobRun.sync",
  "Parameters": {
    "JobName.$": ...,
    "Arguments": {
       ...
    }
  },
  "Next": "Notify Success",
  "ResultPath": null,
  "Catch": [
    {
      "ErrorEquals": [
        "States.ALL"
      ],
      "Next": "Notify Failure"
    }
  ]
}

The job fails, and even has "Attempt": 0 in the cause field of the failure event. Is there a way to "Catch" on this? Or another method of having the step function wait for the Glue Job to complete its retries?

We could have the SFN manage all of the retries, but I'd rather not do that as there's a lot of delay between SFN<>Glue.

已提問 1 年前檢視次數 222 次
1 個回答
0

Normally when you use Step functions, you handle the retries in the state engine and not using the job built-in retries.
Also, doing it that way allows you better control (for instance, exponential back off)

profile pictureAWS
專家
已回答 1 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南