Sagemaker Pipeline FailStep Error Message Not Shown

0

I created a Sagemaker pipeline using Python SDK containing a fail step with a custom error message, which is part of a condition step, as seen at the bottom of this post.

Whenever the execution of the pipeline fails due to the fact that the trained model's accuracy is lower than the threshold, the FailStep custom error message is not displayed anywhere: not in the stdout console where Im running the pipeline script, not in the CloudWatch logs and nowhere in the AWS Sagemaker Console. The pipeline execution simply fails with some general WaiterError message:

botocore.exceptions.WaiterError: Waiter PipelineExecutionComplete failed: Waiter encountered a terminal failure state: For expression "PipelineExecutionStatus" we matched expected path: "Failed"

Therefore, I have no way to know why did the pipeline failed.... What am I missing here? Where can I find the FailStep message at runtime?

step_fail = FailStep(
        name="AccuracyFailStep",
        error_message=Join(on=" ", values=["Execution failed due to binary accuracy < ", accuracy_threshold]),
)

step_cond = ConditionStep(
        name="CheckAccuracyEvaluationStep",
        conditions=[cond_lte],
        if_steps=[step_create_model, step_register_model, step_deploy_model],
        else_steps=[step_fail]
)
profile picture
질문됨 일 년 전1612회 조회
1개 답변
0

Once the FailStep is reached, the execution fails and the error message is set as the failure reason. To be more specific, this step will first fail the pipeline exection, which results in the waiter timeout. Then it will record your provided message as failure reason in meta data of this execution.

This failure reason field will be available when you call describe pipelien execution api. In the response, as described in https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribePipelineExecution.html#API_DescribePipelineExecution_ResponseSyntax

AWS
답변함 일 년 전
  • The describe method does not retrieve any specific reason about why the pipeline execution failed. The field FailureReason only has this value: "'Step failure: One or multiple steps failed.'" No information about which step failed or why whatsoever... Where is then this metadata containing the error messages thrown by the pipeline's execution at runtime that you mentioned?

  • Seems the step failure reason is not marked as the pipeline failure reason. Could you please try https://docs.aws.amazon.com/cli/latest/reference/sagemaker/list-pipeline-execution-steps.html this API to see if the failure step will show the provided error message ?

  • Hi, yes, I can confirm that and I was actually about to post an answer to my own question after finding out that indeed the list_steps function contains the pipeline's execution metadata of all steps, such as, the status and the error message in case of failure. Thank you

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠