How to send own failure info in case of failed SageMaker Training Job?

0

Good day!

My main purpose:

Easy way to collect information about different failure scenarios in SageMaker TrainingJob.

What do I use currently?

Sagemaker SKLearn Estimator(TrainingJobs are inside)

Where will my model train?

Different datasets. So, I need control and collect all information about all training processes and their final statuses on different datasets.

Which failure scenarios do I have?

There are plenty of them. I have create my own python Errors for them.
For example:

  1. There are labels only for one class.
  2. Too small dataset(by my own criterions)
  3. Missing data for crucial columns
  4. e.t.c.

Where am I stuck?

After failed training I can't get own errors from training job response. All of them are "ExecuteUserScriptError" I can't pass my own info in FailureReason or ErrorMessage(always it's empty). I see which error was raised in CloudWatchLogs and TrainingJobTraceback(from SagemakerNotebook). So, bad solution is parse all CloudWatchLogs in case of failure.

**Question: How to provide my own ErrorMessage or FailureReason? **

May be I am digging in the wrong direction. Anyway, I need your advice. Thank you so much for possibility to ask an advice here)

4 Answers
0

In: sklearn_estimator.latest_training_job.describe()['FailureReason']

Out:

Traceback (most recent call last):
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_trainer.py", line 84, in train
    entrypoint()
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 39, in main
    train(environment.Environment())
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 35, in train
    runner_type=runner.ProcessRunnerType)
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/entry_point.py", line 100, in run
    wait, capture_error
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 291, in run
    cwd=environment.code_dir,
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 208, in check_error
    info=extra_info,
sagemaker_training.errors.ExecuteUserScriptError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage ""
Command "/miniconda3/bin/python train.py"

ExecuteUserScriptErro```
answered 2 years ago
  • Look, ErrorMessage "" is empty.
    Output of FailureReason is limited by 1024. But I look at full text and it's useless too. So I have no way to get any failure information.

  • Even the failure was by my own failed scenario and error: This trace from failed training output in SageMager Notebook:

        raise NoPositiveLabels
    custom_errors.NoPositiveLabels
    2022-05-18 10:11:09,864 sagemaker-containers ERROR    Reporting training FAILURE
    2022-05-18 10:11:09,865 sagemaker-containers ERROR    framework error: ```
0

Hello, is the reported problem similar to this issue reported on SageMaker Python SDK project ?

https://github.com/aws/sagemaker-python-sdk/issues/1952

AWS
answered 2 years ago
  • Hello, thank you for your response! :)

    Nope. It's other problem.

    In this issue the author uses ValueError to read input data into input_fn, but failure in this case is unlikely due to this error (it took too long to read the data - 10 hours). But even if it did.

    The question is different: How to pass the failure training_job information via FailureReason, ErrorMessage or other parameters? I'm causing a failure via own Error and want to understand how this information can be passed and collected?

0

Hi, did you try writing to /opt/ml/output/failure as per the doc here?

It's worth mentioning that in the past there was a bug that overwrote this file in the base training toolkit that powers "script mode" containers. This got resolved at source per the linked issue, but I guess there's a chance older containers or frameworks which customize this tool could still be affected? So may be worth upgrading your framework version if you're using an older one.

AWS
EXPERT
Alex_T
answered 2 years ago
  • Hi, Alex! Your comment is really worthwhile! I will add my answer with attachments below

0

This issue looks like related to my problem) There are 2 sagemaker-scikit-learn versions: 0.23-1, 0.20-0.
So, I use: 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3 sagemaker version: 2.86.0 (tried to upgrade with pip in terminal to 2.90.0 , but there is the previous version 2.86 in notebooks) Python - 3.7

How can I understand which versions I should use to get away of ErrorMessage problem?

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions