I want to troubleshoot errors that I receive when I run Amazon SageMaker AI training jobs.
Resolution
To identify the error of your SageMaker AI training job, check the failure reason on the SageMaker AI console or in the DescribeTrainingJob API call. Then, complete the resolution for your job error.
Internal server error
To make sure that a transient issue doesn't cause the error, retry the job.
If the job fails when you retry it, then view the logs for training jobs on Amazon CloudWatch. Review job metrics, such as CPUUtilization, MemoryUtilization, and DiskUtilization, to check if the failure occurred because of a resource limitation. You can also view the training job logs and job metrics on the SageMaker AI console.
If CPUUtilization or MemoryUtilization is high then use a larger training job instance size. If DiskUtilization is high, then increase the VolumeSizeInGB parameter when you create the training job.
Instance capacity error
If the training job fails with an instance capacity error, then there isn't enough on-demand capacity to complete the job. For more information, see How do I troubleshoot an insufficient capacity error when launching my Amazon SageMaker AI resources?
To resolve the error, take one of the following actions:
- Delay your request and try your request later. Capacity issues are transient and might resolve when you retry your request.
- Switch to a different instance type or size with more capacity.
- Launch the training job in a different AWS Region.
MaxRuntimeExceeded error
The default maximum runtime for a training job is 1 day. You can adjust the runtime to a maximum of 28 days. To increase the maximum runtime value, pass the MaxRuntimeInSeconds parameter in the CreateTrainingJob API or the max_run parameter in your SageMaker AI Python SDK Estimator. For more information, see Estimators on the Amazon SageMaker Python SDK website.
Related information
Logs for built-in algorithms