How to send own failure info in case of failed SageMaker Training Job?

0

Good day!

My main purpose:

Easy way to collect information about different failure scenarios in SageMaker TrainingJob.

What do I use currently?

Sagemaker SKLearn Estimator(TrainingJobs are inside)

Where will my model train?

Different datasets. So, I need control and collect all information about all training processes and their final statuses on different datasets.

Which failure scenarios do I have?

There are plenty of them. I have create my own python Errors for them.
For example:

  1. There are labels only for one class.
  2. Too small dataset(by my own criterions)
  3. Missing data for crucial columns
  4. e.t.c.

Where am I stuck?

After failed training I can't get own errors from training job response. All of them are "ExecuteUserScriptError" I can't pass my own info in FailureReason or ErrorMessage(always it's empty). I see which error was raised in CloudWatchLogs and TrainingJobTraceback(from SagemakerNotebook). So, bad solution is parse all CloudWatchLogs in case of failure.

**Question: How to provide my own ErrorMessage or FailureReason? **

May be I am digging in the wrong direction. Anyway, I need your advice. Thank you so much for possibility to ask an advice here)

4 Antworten
0

In: sklearn_estimator.latest_training_job.describe()['FailureReason']

Out:

Traceback (most recent call last):
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_trainer.py", line 84, in train
    entrypoint()
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 39, in main
    train(environment.Environment())
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 35, in train
    runner_type=runner.ProcessRunnerType)
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/entry_point.py", line 100, in run
    wait, capture_error
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 291, in run
    cwd=environment.code_dir,
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 208, in check_error
    info=extra_info,
sagemaker_training.errors.ExecuteUserScriptError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage ""
Command "/miniconda3/bin/python train.py"

ExecuteUserScriptErro```
beantwortet vor 2 Jahren
  • Look, ErrorMessage "" is empty.
    Output of FailureReason is limited by 1024. But I look at full text and it's useless too. So I have no way to get any failure information.

  • Even the failure was by my own failed scenario and error: This trace from failed training output in SageMager Notebook:

        raise NoPositiveLabels
    custom_errors.NoPositiveLabels
    2022-05-18 10:11:09,864 sagemaker-containers ERROR    Reporting training FAILURE
    2022-05-18 10:11:09,865 sagemaker-containers ERROR    framework error: ```
0

Hello, is the reported problem similar to this issue reported on SageMaker Python SDK project ?

https://github.com/aws/sagemaker-python-sdk/issues/1952

AWS
beantwortet vor 2 Jahren
  • Hello, thank you for your response! :)

    Nope. It's other problem.

    In this issue the author uses ValueError to read input data into input_fn, but failure in this case is unlikely due to this error (it took too long to read the data - 10 hours). But even if it did.

    The question is different: How to pass the failure training_job information via FailureReason, ErrorMessage or other parameters? I'm causing a failure via own Error and want to understand how this information can be passed and collected?

0

Hi, did you try writing to /opt/ml/output/failure as per the doc here?

It's worth mentioning that in the past there was a bug that overwrote this file in the base training toolkit that powers "script mode" containers. This got resolved at source per the linked issue, but I guess there's a chance older containers or frameworks which customize this tool could still be affected? So may be worth upgrading your framework version if you're using an older one.

AWS
EXPERTE
Alex_T
beantwortet vor 2 Jahren
  • Hi, Alex! Your comment is really worthwhile! I will add my answer with attachments below

0

This issue looks like related to my problem) There are 2 sagemaker-scikit-learn versions: 0.23-1, 0.20-0.
So, I use: 683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3 sagemaker version: 2.86.0 (tried to upgrade with pip in terminal to 2.90.0 , but there is the previous version 2.86 in notebooks) Python - 3.7

How can I understand which versions I should use to get away of ErrorMessage problem?

beantwortet vor 2 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen