How do I troubleshoot errors that I receive when I run SageMaker AI training jobs?

2 minute read
0

I want to troubleshoot errors that I receive when I run Amazon SageMaker AI training jobs.

Resolution

To identify the error of your SageMaker AI training job, check the failure reason on the SageMaker AI console or in the DescribeTrainingJob API call. Then, complete the resolution for your job error.

Internal server error

To make sure that a transient issue doesn't cause the error, retry the job.

If the job fails when you retry it, then view the logs for training jobs on Amazon CloudWatch. Review job metrics, such as CPUUtilization, MemoryUtilization, and DiskUtilization, to check if the failure occurred because of a resource limitation. You can also view the training job logs and job metrics on the SageMaker AI console.

If CPUUtilization or MemoryUtilization is high then use a larger training job instance size. If DiskUtilization is high, then increase the VolumeSizeInGB parameter when you create the training job.

Instance capacity error

If the training job fails with an instance capacity error, then there isn't enough on-demand capacity to complete the job. For more information, see How do I troubleshoot an insufficient capacity error when launching my Amazon SageMaker AI resources?

To resolve the error, take one of the following actions:

  • Delay your request and try your request later. Capacity issues are transient and might resolve when you retry your request.
  • Switch to a different instance type or size with more capacity.
  • Launch the training job in a different AWS Region.

MaxRuntimeExceeded error

The default maximum runtime for a training job is 1 day. You can adjust the runtime to a maximum of 28 days. To increase the maximum runtime value, pass the MaxRuntimeInSeconds parameter in the CreateTrainingJob API or the max_run parameter in your SageMaker AI Python SDK Estimator. For more information, see Estimators on the Amazon SageMaker Python SDK website.

Related information

Logs for built-in algorithms

AWS OFFICIAL
AWS OFFICIALUpdated 2 months ago