deploying previously trained model with Sagemaker Python SDK (StatusExceptionError)

Question

I am using a pertained Random Forest Model and trying to deploy it on Amazon Sagemker using Python SDK:

```
from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(
    entry_point='script.py',
    role = get_execution_role(),
    instance_count=1,
    instance_type='ml.m4.xlarge',
    framework_version='0.20.0',
    base_job_name='rf-scikit')

sklearn_estimator.fit({'train':trainpath, 'test': testpath}, wait=False)

sklearn_estimator.latest_training_job.wait(logs='None')
artifact = m_boto3.describe_training_job(
    TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']

print('Model artifact persisted at ' + artifact)
```
I get the following StatusException Error

```
2022-08-25 12:03:27 Starting - Starting the training job....
2022-08-25 12:03:52 Starting - Preparing the instances for training............
2022-08-25 12:04:55 Downloading - Downloading input data......
2022-08-25 12:05:31 Training - Downloading the training image.........
2022-08-25 12:06:22 Training - Training image download completed. Training in progress..
2022-08-25 12:06:32 Uploading - Uploading generated training model.
2022-08-25 12:06:43 Failed - Training job failed
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
 in 
----> 1 sklearn_estimator.latest_training_job.wait(logs='None')
      2 artifact = m_boto3.describe_training_job(
      3     TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']
      4 
      5 print('Model artifact persisted at ' + artifact)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   2109             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   2110         else:
-> 2111             self.sagemaker_session.wait_for_job(self.job_name)
   2112 
   2113     def describe(self):

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in wait_for_job(self, job, poll)
   3226             lambda last_desc: _train_done(self.sagemaker_client, job, last_desc), None, poll
   3227         )
-> 3228         self._check_job_status(job, desc, "TrainingJobStatus")
   3229         return desc
   3230

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   3390                 message=message,
   3391                 allowed_statuses=["Completed", "Stopped"],
-> 3392                 actual_status=status,
   3393             )
   3394

UnexpectedStatusException: Error for Training job rf-scikit-2022-08-25-12-03-25-931: Failed. Reason: AlgorithmError: framework error: 
Traceback (most recent call last):
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_trainer.py", line 84, in train
    entrypoint()
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 39, in main
    train(environment.Environment())
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 35, in train
    runner_type=runner.ProcessRunnerType)
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/entry_point.py", line 100, in run
    wait, capture_error
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 291, in run
    cwd=environment.code_dir,
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 208, in check_error
    info=extra_info,
sagemaker_training.errors.ExecuteUserScriptError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage ""
Command "/miniconda3/bin/python script.py"

ExecuteUserScriptErr
```

The pertained model works fine and I don't know what the problem is, please help

Answer

You might consider reviewing your 'script.py' entry point. There could be a variety of reasons for a training job to fail but the most likely, I can see, from the description and output would be related to "where" the model artifacts were written to within your script.

The SageMaker Github examples contain has an example of using a RandomForestRegressor in a script - https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-script-mode/sagemaker-script-mode.ipynb

I'm sharing this example because if you refer to the Scikit-learn section, you'll find the "train_deploy_scikitlearn_without_dependencies.py" script is referenced and the model is dumped to the model_dir: ```joblib.dump(model, os.path.join(args.model_dir, "model.joblib"))```. If we were to change that to some arbitrary location in the script then the example training job would fail with an *AlgorithmError: framework error* as well. As long as the 10 second training is expected then I see the output location as a likely cause.

For more details on this you can refer to the following two resources:

1. How Amazon SageMaker Processes Training Output - https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-output.html
2. Using the SageMaker Python SDK - https://sagemaker.readthedocs.io/en/stable/overview.html

In the first resource, you'll find that your algorithm should write all final model artifacts to ```opt/ml/model```. In the second resource, you'll find more information on proper use of the SageMaker Python SDK and various implementations.

deploying previously trained model with Sagemaker Python SDK (StatusExceptionError)

관련 콘텐츠