How to send own failure info in case of failed SageMaker Training Job?
Good day!
My main purpose:
Easy way to collect information about different failure scenarios in SageMaker TrainingJob.
What do I use currently?
Sagemaker SKLearn Estimator(TrainingJobs are inside)
Where will my model train?
Different datasets. So, I need control and collect all information about all training processes and their final statuses on different datasets.
Which failure scenarios do I have?
There are plenty of them. I have create my own python Errors for them.
For example:
- There are labels only for one class.
- Too small dataset(by my own criterions)
- Missing data for crucial columns
- e.t.c.
Where am I stuck?
After failed training I can't get own errors from training job response. All of them are "ExecuteUserScriptError" I can't pass my own info in FailureReason or ErrorMessage(always it's empty). I see which error was raised in CloudWatchLogs and TrainingJobTraceback(from SagemakerNotebook). So, bad solution is parse all CloudWatchLogs in case of failure.
**Question: How to provide my own ErrorMessage or FailureReason? **
May be I am digging in the wrong direction. Anyway, I need your advice. Thank you so much for possibility to ask an advice here)
In: sklearn_estimator.latest_training_job.describe()['FailureReason']
Out:
Traceback (most recent call last): File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_trainer.py", line 84, in train entrypoint() File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 39, in main train(environment.Environment()) File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 35, in train runner_type=runner.ProcessRunnerType) File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/entry_point.py", line 100, in run wait, capture_error File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 291, in run cwd=environment.code_dir, File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 208, in check_error info=extra_info, sagemaker_training.errors.ExecuteUserScriptError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "" Command "/miniconda3/bin/python train.py" ExecuteUserScriptErro```
Even the failure was by my own failed scenario and error: This trace from failed training output in SageMager Notebook:
raise NoPositiveLabels custom_errors.NoPositiveLabels 2022-05-18 10:11:09,864 sagemaker-containers ERROR Reporting training FAILURE 2022-05-18 10:11:09,865 sagemaker-containers ERROR framework error: ```
Hello, is the reported problem similar to this issue reported on SageMaker Python SDK project ?
Hello, thank you for your response! :)
Nope. It's other problem.
In this issue the author uses ValueError to read input data into input_fn, but failure in this case is unlikely due to this error (it took too long to read the data - 10 hours). But even if it did.
The question is different: How to pass the failure training_job information via FailureReason, ErrorMessage or other parameters? I'm causing a failure via own Error and want to understand how this information can be passed and collected?
Hi, did you try writing to /opt/ml/output/failure
as per the doc here?
It's worth mentioning that in the past there was a bug that overwrote this file in the base training toolkit that powers "script mode" containers. This got resolved at source per the linked issue, but I guess there's a chance older containers or frameworks which customize this tool could still be affected? So may be worth upgrading your framework version if you're using an older one.
Hi, Alex! Your comment is really worthwhile! I will add my answer with attachments below
This issue looks like related to my problem)
There are 2 sagemaker-scikit-learn versions: 0.23-1, 0.20-0.
So, I use:
683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3
sagemaker version: 2.86.0 (tried to upgrade with pip in terminal to 2.90.0 , but there is the previous version 2.86 in notebooks)
Python - 3.7
How can I understand which versions I should use to get away of ErrorMessage problem?
Relevant questions
FSxLustre FileSystemInput in Sagemaker TrainingJob leads to: InternalServerError
asked 20 days agohow can I use sagemaker_sklearn_extension in Sagemaker job?
asked 9 days agoHow to send own failure info in case of failed SageMaker Training Job?
asked 2 months agoIs it possible to create Parallel Pipelines in Sagemaker
asked 3 months agoExporting Sagemaker model to local computer
asked 3 months agoSageMaker training job is not stopping
asked 2 months agoOptimal notebook instance type for DeepAR in AWS Sagemaker
Accepted Answerasked 5 months agoHow to checkpoint SageMaker model artifact during a training job?
Accepted AnswerDetermining the "right" instance type running Jupyter notebook in Sagemaker when reading/writing a huge parquet file?
asked 20 days agoIs there a way to automate failure handling and retries when using Amazon SageMaker batch transform?
Accepted Answer
Look,
ErrorMessage ""
is empty.Output of FailureReason is limited by 1024. But I look at full text and it's useless too. So I have no way to get any failure information.