Step Function GlueJob "startRunJob.sync" ignores retries by the glue job when determining success/failure of the task

0

We have a step function for orchestrating various Glue jobs. The Glue jobs have a built-in retry mechanism and we currently have them set to retry once. In the case where the job fails the first time but succeeds on the Glue Job retry (not the SFN task retry), the step function thinks the task has already failed.

Here's an example of the task as defined in SFN:

{
  "Type": "Task",
  "Resource": "arn:aws:states:::glue:startJobRun.sync",
  "Parameters": {
    "JobName.$": ...,
    "Arguments": {
       ...
    }
  },
  "Next": "Notify Success",
  "ResultPath": null,
  "Catch": [
    {
      "ErrorEquals": [
        "States.ALL"
      ],
      "Next": "Notify Failure"
    }
  ]
}

The job fails, and even has "Attempt": 0 in the cause field of the failure event. Is there a way to "Catch" on this? Or another method of having the step function wait for the Glue Job to complete its retries?

We could have the SFN manage all of the retries, but I'd rather not do that as there's a lot of delay between SFN<>Glue.

質問済み 1年前222ビュー
1回答
0

Normally when you use Step functions, you handle the retries in the state engine and not using the job built-in retries.
Also, doing it that way allows you better control (for instance, exponential back off)

profile pictureAWS
エキスパート
回答済み 1年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ