How to send own failure info in case of failed SageMaker Training Job?
My main purpose:
Easy way to collect information about different failure scenarios in SageMaker TrainingJob.
What do I use currently?
Sagemaker SKLearn Estimator(TrainingJobs are inside)
Where will my model train?
Different datasets. So, I need control and collect all information about all training processes and their final statuses on different datasets.
Which failure scenarios do I have?
There are plenty of them. I have create my own python Errors for them.
- There are labels only for one class.
- Too small dataset(by my own criterions)
- Missing data for crucial columns
Where am I stuck?
After failed training I can't get own errors from training job response. All of them are "ExecuteUserScriptError" I can't pass my own info in FailureReason or ErrorMessage(always it's empty). I see which error was raised in CloudWatchLogs and TrainingJobTraceback(from SagemakerNotebook). So, bad solution is parse all CloudWatchLogs in case of failure.
**Question: How to provide my own ErrorMessage or FailureReason? **
May be I am digging in the wrong direction. Anyway, I need your advice. Thank you so much for possibility to ask an advice here)
Traceback (most recent call last): File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_trainer.py", line 84, in train entrypoint() File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 39, in main train(environment.Environment()) File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 35, in train runner_type=runner.ProcessRunnerType) File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/entry_point.py", line 100, in run wait, capture_error File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 291, in run cwd=environment.code_dir, File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 208, in check_error info=extra_info, sagemaker_training.errors.ExecuteUserScriptError: ExecuteUserScriptError: ExitCode 1 ErrorMessage "" Command "/miniconda3/bin/python train.py" ExecuteUserScriptErro```
ErrorMessage ""is empty.
Output of FailureReason is limited by 1024. But I look at full text and it's useless too. So I have no way to get any failure information.
Even the failure was by my own failed scenario and error: This trace from failed training output in SageMager Notebook:
raise NoPositiveLabels custom_errors.NoPositiveLabels 2022-05-18 10:11:09,864 sagemaker-containers ERROR Reporting training FAILURE 2022-05-18 10:11:09,865 sagemaker-containers ERROR framework error: ```
Hello, is the reported problem similar to this issue reported on SageMaker Python SDK project ?
Hello, thank you for your response! :)
Nope. It's other problem.
In this issue the author uses ValueError to read input data into input_fn, but failure in this case is unlikely due to this error (it took too long to read the data - 10 hours). But even if it did.
The question is different: How to pass the failure training_job information via FailureReason, ErrorMessage or other parameters? I'm causing a failure via own Error and want to understand how this information can be passed and collected?
Hi, did you try writing to
/opt/ml/output/failure as per the doc here?
It's worth mentioning that in the past there was a bug that overwrote this file in the base training toolkit that powers "script mode" containers. This got resolved at source per the linked issue, but I guess there's a chance older containers or frameworks which customize this tool could still be affected? So may be worth upgrading your framework version if you're using an older one.
Hi, Alex! Your comment is really worthwhile! I will add my answer with attachments below
This issue looks like related to my problem)
There are 2 sagemaker-scikit-learn versions: 0.23-1, 0.20-0.
So, I use:
sagemaker version: 2.86.0 (tried to upgrade with pip in terminal to 2.90.0 , but there is the previous version 2.86 in notebooks)
Python - 3.7
How can I understand which versions I should use to get away of ErrorMessage problem?
FSxLustre FileSystemInput in Sagemaker TrainingJob leads to: InternalServerErrorasked 20 days ago
how can I use sagemaker_sklearn_extension in Sagemaker job?asked 9 days ago
How to send own failure info in case of failed SageMaker Training Job?asked 2 months ago
Is it possible to create Parallel Pipelines in Sagemakerasked 3 months ago
Exporting Sagemaker model to local computerasked 3 months ago
SageMaker training job is not stoppingasked 2 months ago
Optimal notebook instance type for DeepAR in AWS SagemakerAccepted Answerasked 5 months ago
How to checkpoint SageMaker model artifact during a training job?Accepted Answer
Determining the "right" instance type running Jupyter notebook in Sagemaker when reading/writing a huge parquet file?asked 20 days ago
Is there a way to automate failure handling and retries when using Amazon SageMaker batch transform?Accepted Answer