Sagemaker Pipeline FailStep Error Message Not Shown

0

I created a Sagemaker pipeline using Python SDK containing a fail step with a custom error message, which is part of a condition step, as seen at the bottom of this post.

Whenever the execution of the pipeline fails due to the fact that the trained model's accuracy is lower than the threshold, the FailStep custom error message is not displayed anywhere: not in the stdout console where Im running the pipeline script, not in the CloudWatch logs and nowhere in the AWS Sagemaker Console. The pipeline execution simply fails with some general WaiterError message:

botocore.exceptions.WaiterError: Waiter PipelineExecutionComplete failed: Waiter encountered a terminal failure state: For expression "PipelineExecutionStatus" we matched expected path: "Failed"

Therefore, I have no way to know why did the pipeline failed.... What am I missing here? Where can I find the FailStep message at runtime?

step_fail = FailStep(
        name="AccuracyFailStep",
        error_message=Join(on=" ", values=["Execution failed due to binary accuracy < ", accuracy_threshold]),
)

step_cond = ConditionStep(
        name="CheckAccuracyEvaluationStep",
        conditions=[cond_lte],
        if_steps=[step_create_model, step_register_model, step_deploy_model],
        else_steps=[step_fail]
)
profile picture
asked a year ago1596 views
1 Answer
0

Once the FailStep is reached, the execution fails and the error message is set as the failure reason. To be more specific, this step will first fail the pipeline exection, which results in the waiter timeout. Then it will record your provided message as failure reason in meta data of this execution.

This failure reason field will be available when you call describe pipelien execution api. In the response, as described in https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribePipelineExecution.html#API_DescribePipelineExecution_ResponseSyntax

AWS
answered a year ago
  • The describe method does not retrieve any specific reason about why the pipeline execution failed. The field FailureReason only has this value: "'Step failure: One or multiple steps failed.'" No information about which step failed or why whatsoever... Where is then this metadata containing the error messages thrown by the pipeline's execution at runtime that you mentioned?

  • Seems the step failure reason is not marked as the pipeline failure reason. Could you please try https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-pipeline-execution-steps.html this API to see if the failure step will show the provided error message ?

  • Hi, yes, I can confirm that and I was actually about to post an answer to my own question after finding out that indeed the list_steps function contains the pipeline's execution metadata of all steps, such as, the status and the error message in case of failure. Thank you

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions