SageMaker training job fails with memory error on model upload after training completion

0

I am running into an issue where a SageMaker training job will successfully complete the training script but will fail during the model upload stage after training. The specific error given in the SageMaker portal is the following:

ClientError: Please use an instance type with more memory, or reduce the size of training data processed on an instance.

I upgraded the instance used for training but this did not resolve the issue. Furthermore, the issue seems to occur randomly for different training jobs where the only difference is the dataset. All other characteristics are the same. It is strange to me why this should affect the model upload stage while the training job itself is successful. Note that I do not store the training data in the model output directory. The CloudWatch log reports successful completion of the training job:

2023-06-14 07:50:10,625 sagemaker-training-toolkit INFO Reporting training SUCCESS

And the job status overview in the SageMaker portal shows where the issue arises:

Status History Starting 6/11/2023, 8:12:33 PM 6/11/2023, 8:13:47 PM Preparing the instances for training Downloading 6/11/2023, 8:13:47 PM 6/11/2023, 8:15:26 PM Downloading input data Training 6/11/2023, 8:15:26 PM 6/14/2023, 1:51:15 AM Training image download completed. Training in progress. Uploading 6/14/2023, 1:51:15 AM 6/14/2023, 1:51:21 AM Uploading generated training model Failed 6/14/2023, 1:51:21 AM 6/14/2023, 1:51:21 AM Training job failed

Anyone else encountered a similar issue? Any ideas on how to address this issue?

user123
질문됨 일 년 전984회 조회
2개 답변
1

We had the same problem. The failure at the model upload stage turned out to be a red herring. A subprocess during the training stage was being killed due to being out of memory, not during the model upload stage. Note that our main training process exited cleanly.

jmsmkn
답변함 4달 전
  • This turned out to be the actual issue for me as well. Increasing the instance memory and playing around with different SageMaker instances solved the problem for me. Thanks for commenting!!

0

As per this document https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-output.html the final model should be written to /opt/ml/model by the algorithm in order to successfully upload it to S3 as a single object in compressed tar format

AWS
sqavi
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인