How do I troubleshoot the "ResourceLimitExceeded" error in SageMaker?

2 minute read
0

I want to troubleshoot the "ResourceLimitExceeded" error in Amazon SageMaker.

Resolution

When you create a SageMaker resource, you might get the ResourceLimitExceeded error. These resources include a SageMaker training job, a processing job, an endpoint for hosting, or a Studio app. You might also receive the error when you change the instance configuration of an existing resource.

Example error:

"The account-level service limit 'ml.m5.xlarge for endpoint usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit."

When you exceed the AWS account level service quotas that are specified for your SageMaker resources, this error occurs. Note that all quotas are specific to your account, AWS Region, and usage type. To resolve this error, complete the following steps:

  1. Open the Service Quotas console.
    Note: To use the Service Quotas console, your user or role must have corresponding AWS Identity and Access Management (IAM) permissions.
  2. In the navigation pane, choose AWS services.
  3. In the Search bar, enter Amazon SageMaker. Then, choose Amazon SageMaker.
  4. Select the quota that you want to increase. For the example error message, select ml.m5.xlarge for endpoint usage.
  5. Choose Request increase at account-level.
  6. For Increase quota value, enter the desired value.
  7. Choose Request.

This sends your request to AWS Support. Based on your use case and current usage, AWS Support either approves, denies, or partially approves your request.

Related information

AWS service quotas

SageMaker service quotas

CreateTrainingJob

InstanceGroup

AWS OFFICIAL
AWS OFFICIALUpdated 5 months ago
11 Comments

I followed these steps for ml.g5.4xlarge for notebook instance usage. However, in step 6 I see that the quota is already at 1, contradicting the error message I get when trying to spin up the respective notebook instance.

How can I fix this?

replied 2 years ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 2 years ago

Hello, I have a similar error however, I am not sure what service I should increase since the error message doesn't specify. Here is the error: "botocore.errorfactory.ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: Resource limits for this account have been exceeded. Please contact Customer Support for assistance."

replied 2 years ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
EXPERT
replied 2 years ago

how can we catch "ResourceLimitExceeded" event to convert it into an CloudWatch Alarm?

replied 2 years ago

I have a similar issue and my Account quota is 4 for the selected instance type, but still getting the error that states it is 0

AWS
EXPERT
replied 2 years ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
EXPERT
replied 2 years ago

I have a similar issue: ResourceLimitExceeded

The account-level service limit 'ml.p3.2xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please use AWS Service Quotas to request an increase for this quota. If AWS Service Quotas is not available, contact AWS support to request an increase for this quota.

My problem is I haven't used any resources yet. Seems a bit odd to request more.

replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied a year ago

The above solutions seems like good options, but try this first:

  1. Delete the Sage Maker End Point (from the Sage Maker Dashboard)
  2. Restart the Kernel It should automatically recreate an End Point and reset the resource pool count..
replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied a year ago