deploying previously trained model with Sagemaker Python SDK (StatusExceptionError)

0

I am using a pertained Random Forest Model and trying to deploy it on Amazon Sagemker using Python SDK:

from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(
    entry_point='script.py',
    role = get_execution_role(),
    instance_count=1,
    instance_type='ml.m4.xlarge',
    framework_version='0.20.0',
    base_job_name='rf-scikit')

sklearn_estimator.fit({'train':trainpath, 'test': testpath}, wait=False)

sklearn_estimator.latest_training_job.wait(logs='None')
artifact = m_boto3.describe_training_job(
    TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']

print('Model artifact persisted at ' + artifact)

I get the following StatusException Error

2022-08-25 12:03:27 Starting - Starting the training job....
2022-08-25 12:03:52 Starting - Preparing the instances for training............
2022-08-25 12:04:55 Downloading - Downloading input data......
2022-08-25 12:05:31 Training - Downloading the training image.........
2022-08-25 12:06:22 Training - Training image download completed. Training in progress..
2022-08-25 12:06:32 Uploading - Uploading generated training model.
2022-08-25 12:06:43 Failed - Training job failed
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-37-628f942a78d3> in <module>
----> 1 sklearn_estimator.latest_training_job.wait(logs='None')
      2 artifact = m_boto3.describe_training_job(
      3     TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']
      4 
      5 print('Model artifact persisted at ' + artifact)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   2109             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   2110         else:
-> 2111             self.sagemaker_session.wait_for_job(self.job_name)
   2112 
   2113     def describe(self):

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in wait_for_job(self, job, poll)
   3226             lambda last_desc: _train_done(self.sagemaker_client, job, last_desc), None, poll
   3227         )
-> 3228         self._check_job_status(job, desc, "TrainingJobStatus")
   3229         return desc
   3230 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   3390                 message=message,
   3391                 allowed_statuses=["Completed", "Stopped"],
-> 3392                 actual_status=status,
   3393             )
   3394 

UnexpectedStatusException: Error for Training job rf-scikit-2022-08-25-12-03-25-931: Failed. Reason: AlgorithmError: framework error: 
Traceback (most recent call last):
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_trainer.py", line 84, in train
    entrypoint()
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 39, in main
    train(environment.Environment())
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 35, in train
    runner_type=runner.ProcessRunnerType)
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/entry_point.py", line 100, in run
    wait, capture_error
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 291, in run
    cwd=environment.code_dir,
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 208, in check_error
    info=extra_info,
sagemaker_training.errors.ExecuteUserScriptError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage ""
Command "/miniconda3/bin/python script.py"

ExecuteUserScriptErr

The pertained model works fine and I don't know what the problem is, please help

1 Answer
0

You might consider reviewing your 'script.py' entry point. There could be a variety of reasons for a training job to fail but the most likely, I can see, from the description and output would be related to "where" the model artifacts were written to within your script.

The SageMaker Github examples contain has an example of using a RandomForestRegressor in a script - https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-script-mode/sagemaker-script-mode.ipynb

I'm sharing this example because if you refer to the Scikit-learn section, you'll find the "train_deploy_scikitlearn_without_dependencies.py" script is referenced and the model is dumped to the model_dir: joblib.dump(model, os.path.join(args.model_dir, "model.joblib")). If we were to change that to some arbitrary location in the script then the example training job would fail with an AlgorithmError: framework error as well. As long as the 10 second training is expected then I see the output location as a likely cause.

For more details on this you can refer to the following two resources:

  1. How Amazon SageMaker Processes Training Output - https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-output.html
  2. Using the SageMaker Python SDK - https://sagemaker.readthedocs.io/en/stable/overview.html

In the first resource, you'll find that your algorithm should write all final model artifacts to opt/ml/model. In the second resource, you'll find more information on proper use of the SageMaker Python SDK and various implementations.

AWS
SUPPORT ENGINEER
answered 2 years ago
  • I found out that the error results from no files found in the SM_CHANNEL_TRAIN and SM_CHANNEL_TEST. I don't understand why because when I run the script using ! python script_rf.py --model-dir ./
    --train ./
    --test ./ \ it works fine.

  • script looks like this:

    def model_fn(model_dir): clf = joblib.load(os.path.join(model_dir, "model.joblib")) return clf

    if name =='main': parser.add_argument('--max_depth', type=int, default=2) parser.add_argument('--n_estimators', type=int, default=100) parser.add_argument('--random_state', type=int, default=0

    parser.add_argument('--model-dir', type=str,default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train-file', type=str, default='revenue_train.csv')
    parser.add_argument('--test-file', type=str, default='revenue_test.csv')
    args, _ = parser.parse_known_args()
    
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))
    if len(train_df) == 0:
        raise ValueError(('There are no files in {}.\n').format(args.train, "train"))
    
    X_train = train_df[attributes]
    X_test = test_df[attributes]
    y_train = train_df['target']
    y_test = test_df['target']
    

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions