SageMaker training job fails with memory error on model upload after training completion

0

I am running into an issue where a SageMaker training job will successfully complete the training script but will fail during the model upload stage after training. The specific error given in the SageMaker portal is the following:

ClientError: Please use an instance type with more memory, or reduce the size of training data processed on an instance.

I upgraded the instance used for training but this did not resolve the issue. Furthermore, the issue seems to occur randomly for different training jobs where the only difference is the dataset. All other characteristics are the same. It is strange to me why this should affect the model upload stage while the training job itself is successful. Note that I do not store the training data in the model output directory. The CloudWatch log reports successful completion of the training job:

2023-06-14 07:50:10,625 sagemaker-training-toolkit INFO Reporting training SUCCESS

And the job status overview in the SageMaker portal shows where the issue arises:

Status History Starting 6/11/2023, 8:12:33 PM 6/11/2023, 8:13:47 PM Preparing the instances for training Downloading 6/11/2023, 8:13:47 PM 6/11/2023, 8:15:26 PM Downloading input data Training 6/11/2023, 8:15:26 PM 6/14/2023, 1:51:15 AM Training image download completed. Training in progress. Uploading 6/14/2023, 1:51:15 AM 6/14/2023, 1:51:21 AM Uploading generated training model Failed 6/14/2023, 1:51:21 AM 6/14/2023, 1:51:21 AM Training job failed

Anyone else encountered a similar issue? Any ideas on how to address this issue?

user123
asked 10 months ago950 views
2 Answers
1

We had the same problem. The failure at the model upload stage turned out to be a red herring. A subprocess during the training stage was being killed due to being out of memory, not during the model upload stage. Note that our main training process exited cleanly.

jmsmkn
answered 3 months ago
  • This turned out to be the actual issue for me as well. Increasing the instance memory and playing around with different SageMaker instances solved the problem for me. Thanks for commenting!!

0

As per this document https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-output.html the final model should be written to /opt/ml/model by the algorithm in order to successfully upload it to S3 as a single object in compressed tar format

AWS
sqavi
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions