Step Function Error Handling

0

Want to Redrive from Failed State SendCompletionNotification Hello team, Can we implement redrive logic to execute from a particular state? My Approach: If any state fails, it will go to the catch lambda, which will process the error payload and check if the error is due to business validation. If the error is due to business validation, it will allow the execution to proceed. However, if the error is other than business validation, it will fail the execution. Is it possible to redrive from a particular failed state? For instance, if the state 'CreateCandidateId' failed due to a dependency outage, is it possible to redrive from that specific state?

  • please accept the answer if it was useful

asked 2 months ago260 views
1 Answer
3
  1. Define catchers and retry policies in your state machine to handle different types of errors. For example, you can catch specific errors and decide whether to retry or proceed to a fallback state.

  2. Use a catch Lambda function to process the error payload and determine the type of error (e.g., business validation or dependency outage). Based on the error type, you can decide the next steps.

  3. Use a Choice state after the catch Lambda to determine the flow based on the error type. If the error is due to a dependency outage, you can redrive from the failed state. If the error is due to business validation, you can fail the execution or proceed as needed.

  4. Pass the state information and error details to the catch Lambda so that it knows which state failed and why. This can be done by including the $.Execution.Input and $.State.Name in the catch handler.

  5. Implement a manual intervention mechanism to redrive from a particular state. This can be achieved using an external system or a Step Functions callback pattern where you trigger a new execution starting from the failed state with the required input.

Here is an example of how to structure your state machine JSON definition:

{
  "Comment": "A sample state machine to demonstrate redrive logic",
  "StartAt": "CreateCandidateId",
  "States": {
    "CreateCandidateId": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:region:account-id:function:createCandidateId",
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "Next": "CatchLambda"
        }
      ],
      "End": true
    },
    "CatchLambda": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:region:account-id:function:catchLambda",
      "ResultPath": "$.catchResult",
      "Next": "ErrorHandler"
    },
    "ErrorHandler": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.catchResult.errorType",
          "StringEquals": "BusinessValidationError",
          "Next": "BusinessValidationState"
        },
        {
          "Variable": "$.catchResult.errorType",
          "StringEquals": "DependencyOutage",
          "Next": "RetryCreateCandidateId"
        }
      ],
      "Default": "FailState"
    },
    "RetryCreateCandidateId": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:region:account-id:function:createCandidateId",
      "Retry": [
        {
          "ErrorEquals": ["States.ALL"],
          "IntervalSeconds": 10,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Catch": [
        {
          "ErrorEquals": ["States.ALL"],
          "ResultPath": "$.error",
          "Next": "FailState"
        }
      ],
      "End": true
    },
    "BusinessValidationState": {
      "Type": "Pass",
      "Result": "Business validation passed, continuing workflow",
      "End": true
    },
    "FailState": {
      "Type": "Fail",
      "Error": "WorkflowFailed",
      "Cause": "State machine execution failed due to an error"
    }
  }
}

Explanation:

  1. CreateCandidateId: This is the initial state that might fail.
  2. CatchLambda: This Lambda function processes the error payload.
  3. ErrorHandler: This Choice state determines the next steps based on the error type.
  4. RetryCreateCandidateId: This state retries the CreateCandidateId task if the error is due to a dependency outage.
  5. BusinessValidationState: This state handles business validation errors.
  6. FailState: This state is reached if the error cannot be handled.

Enter image description here

profile picture
EXPERT
answered 2 months ago
  • Hello Oleksii,

    From the First Catch Lambda, I will be pushing the message to SQS (Dead Letter Queue) with all the required details for redrive, such as stateName, executionId, executionArn, and stateInput. After I fix the issue, like a code bug or anything else that requires manual intervention, I want to perform redrive on this Dead Letter Queue, which will push the message into the main SQS. This main SQS will trigger a Lambda function, which will redrive the state machine.

    My concern is that there will be two failed states: one is the Fail State, and the second can be any TaskState. I inquire whether it is possible to perform redrive from a TaskState because when I was redriving using the AWS console, it was redriving through the Fail State.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions