deploying previously trained model with Sagemaker Python SDK (StatusExceptionError)

0

I am using a pertained Random Forest Model and trying to deploy it on Amazon Sagemker using Python SDK:

from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(
    entry_point='script.py',
    role = get_execution_role(),
    instance_count=1,
    instance_type='ml.m4.xlarge',
    framework_version='0.20.0',
    base_job_name='rf-scikit')

sklearn_estimator.fit({'train':trainpath, 'test': testpath}, wait=False)

sklearn_estimator.latest_training_job.wait(logs='None')
artifact = m_boto3.describe_training_job(
    TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']

print('Model artifact persisted at ' + artifact)

I get the following StatusException Error

2022-08-25 12:03:27 Starting - Starting the training job....
2022-08-25 12:03:52 Starting - Preparing the instances for training............
2022-08-25 12:04:55 Downloading - Downloading input data......
2022-08-25 12:05:31 Training - Downloading the training image.........
2022-08-25 12:06:22 Training - Training image download completed. Training in progress..
2022-08-25 12:06:32 Uploading - Uploading generated training model.
2022-08-25 12:06:43 Failed - Training job failed
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-37-628f942a78d3> in <module>
----> 1 sklearn_estimator.latest_training_job.wait(logs='None')
      2 artifact = m_boto3.describe_training_job(
      3     TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']
      4 
      5 print('Model artifact persisted at ' + artifact)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   2109             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   2110         else:
-> 2111             self.sagemaker_session.wait_for_job(self.job_name)
   2112 
   2113     def describe(self):

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in wait_for_job(self, job, poll)
   3226             lambda last_desc: _train_done(self.sagemaker_client, job, last_desc), None, poll
   3227         )
-> 3228         self._check_job_status(job, desc, "TrainingJobStatus")
   3229         return desc
   3230 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   3390                 message=message,
   3391                 allowed_statuses=["Completed", "Stopped"],
-> 3392                 actual_status=status,
   3393             )
   3394 

UnexpectedStatusException: Error for Training job rf-scikit-2022-08-25-12-03-25-931: Failed. Reason: AlgorithmError: framework error: 
Traceback (most recent call last):
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_containers/_trainer.py", line 84, in train
    entrypoint()
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 39, in main
    train(environment.Environment())
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_sklearn_container/training.py", line 35, in train
    runner_type=runner.ProcessRunnerType)
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/entry_point.py", line 100, in run
    wait, capture_error
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 291, in run
    cwd=environment.code_dir,
  File "/miniconda3/lib/python3.7/site-packages/sagemaker_training/process.py", line 208, in check_error
    info=extra_info,
sagemaker_training.errors.ExecuteUserScriptError: ExecuteUserScriptError:
ExitCode 1
ErrorMessage ""
Command "/miniconda3/bin/python script.py"

ExecuteUserScriptErr

The pertained model works fine and I don't know what the problem is, please help

1개 답변
0

You might consider reviewing your 'script.py' entry point. There could be a variety of reasons for a training job to fail but the most likely, I can see, from the description and output would be related to "where" the model artifacts were written to within your script.

The SageMaker Github examples contain has an example of using a RandomForestRegressor in a script - https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-script-mode/sagemaker-script-mode.ipynb

I'm sharing this example because if you refer to the Scikit-learn section, you'll find the "train_deploy_scikitlearn_without_dependencies.py" script is referenced and the model is dumped to the model_dir: joblib.dump(model, os.path.join(args.model_dir, "model.joblib")). If we were to change that to some arbitrary location in the script then the example training job would fail with an AlgorithmError: framework error as well. As long as the 10 second training is expected then I see the output location as a likely cause.

For more details on this you can refer to the following two resources:

  1. How Amazon SageMaker Processes Training Output - https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-output.html
  2. Using the SageMaker Python SDK - https://sagemaker.readthedocs.io/en/stable/overview.html

In the first resource, you'll find that your algorithm should write all final model artifacts to opt/ml/model. In the second resource, you'll find more information on proper use of the SageMaker Python SDK and various implementations.

AWS
지원 엔지니어
답변함 2년 전
  • I found out that the error results from no files found in the SM_CHANNEL_TRAIN and SM_CHANNEL_TEST. I don't understand why because when I run the script using ! python script_rf.py --model-dir ./
    --train ./
    --test ./ \ it works fine.

  • script looks like this:

    def model_fn(model_dir): clf = joblib.load(os.path.join(model_dir, "model.joblib")) return clf

    if name =='main': parser.add_argument('--max_depth', type=int, default=2) parser.add_argument('--n_estimators', type=int, default=100) parser.add_argument('--random_state', type=int, default=0

    parser.add_argument('--model-dir', type=str,default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train-file', type=str, default='revenue_train.csv')
    parser.add_argument('--test-file', type=str, default='revenue_test.csv')
    args, _ = parser.parse_known_args()
    
    train_df = pd.read_csv(os.path.join(args.train, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test, args.test_file))
    if len(train_df) == 0:
        raise ValueError(('There are no files in {}.\n').format(args.train, "train"))
    
    X_train = train_df[attributes]
    X_test = test_df[attributes]
    y_train = train_df['target']
    y_test = test_df['target']
    

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인