Sagemaker Pipeline FailStep Error Message Not Shown

0

I created a Sagemaker pipeline using Python SDK containing a fail step with a custom error message, which is part of a condition step, as seen at the bottom of this post.

Whenever the execution of the pipeline fails due to the fact that the trained model's accuracy is lower than the threshold, the FailStep custom error message is not displayed anywhere: not in the stdout console where Im running the pipeline script, not in the CloudWatch logs and nowhere in the AWS Sagemaker Console. The pipeline execution simply fails with some general WaiterError message:

botocore.exceptions.WaiterError: Waiter PipelineExecutionComplete failed: Waiter encountered a terminal failure state: For expression "PipelineExecutionStatus" we matched expected path: "Failed"

Therefore, I have no way to know why did the pipeline failed.... What am I missing here? Where can I find the FailStep message at runtime?

step_fail = FailStep(
        name="AccuracyFailStep",
        error_message=Join(on=" ", values=["Execution failed due to binary accuracy < ", accuracy_threshold]),
)

step_cond = ConditionStep(
        name="CheckAccuracyEvaluationStep",
        conditions=[cond_lte],
        if_steps=[step_create_model, step_register_model, step_deploy_model],
        else_steps=[step_fail]
)
profile picture
已提问 1 年前1613 查看次数
1 回答
0

Once the FailStep is reached, the execution fails and the error message is set as the failure reason. To be more specific, this step will first fail the pipeline exection, which results in the waiter timeout. Then it will record your provided message as failure reason in meta data of this execution.

This failure reason field will be available when you call describe pipelien execution api. In the response, as described in https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribePipelineExecution.html#API_DescribePipelineExecution_ResponseSyntax

AWS
已回答 1 年前
  • The describe method does not retrieve any specific reason about why the pipeline execution failed. The field FailureReason only has this value: "'Step failure: One or multiple steps failed.'" No information about which step failed or why whatsoever... Where is then this metadata containing the error messages thrown by the pipeline's execution at runtime that you mentioned?

  • Seems the step failure reason is not marked as the pipeline failure reason. Could you please try https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-pipeline-execution-steps.html this API to see if the failure step will show the provided error message ?

  • Hi, yes, I can confirm that and I was actually about to post an answer to my own question after finding out that indeed the list_steps function contains the pipeline's execution metadata of all steps, such as, the status and the error message in case of failure. Thank you

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则