Skip to content

Sagemaker pipeline stopped working with error "ClientError: Failed to invoke sagemaker:CreateTrainingJob. Error Details: Instance type is not supported by algorithm sagemaker-distribution

0

We have a sagemaker pipeline triggered by a lambda that sets up and calls the pipeline using the sagemaker SDK. It has been working for months. It last ran with no problem 3 days ago. Then next time it ran less than 20 hours ago, it always now fails with the error: ClientError: Failed to invoke sagemaker:CreateTrainingJob. Error Details: Instance type ml.c6i.12xlarge is not supported by algorithm sagemaker-distribution-prod.

There has been no change to any of the code or configuration in that time.

The code is specififying the image as image_uri: str = "542918446943.dkr.ecr.us-west-2.amazonaws.com/sagemaker-distribution-prod:1-cpu",

We are at a loss to figure out how to fix this.

1 Answer
1
Accepted Answer

This issue appears to be related to a sudden incompatibility between the specified instance type and the SageMaker distribution algorithm. Even though your code and configuration haven't changed, it's possible that there have been updates on the AWS side that have affected the compatibility of certain instance types with specific algorithms.

To resolve this issue, you can try the following steps:

  1. Check if the instance type 'ml.c6i.12xlarge' is still supported for training jobs in your region. It's possible that AWS has made changes to the available instance types for SageMaker.

  2. Verify if there have been any recent updates to the SageMaker distribution algorithm that might have changed its instance type compatibility.

  3. Try using a different compatible instance type. For example, you could try using an ml.c5 instance instead of ml.c6i.

  4. Ensure that your SageMaker SDK is up to date. Sometimes, updating to the latest version can resolve compatibility issues.

  5. Check your IAM permissions to make sure you still have the necessary permissions to create training jobs with the specified instance type.

  6. Review the SageMaker quotas for your account to ensure you haven't reached any limits that might prevent you from using certain instance types.

If the issue persists after trying these steps, it would be advisable to contact AWS support for further assistance. They can provide more specific information about any recent changes that might have affected your pipeline and help you find a suitable solution.
Sources
CreateTrainingJob - Amazon SageMaker
CreateOptimizationJob - Amazon SageMaker

answered a year ago
EXPERT
reviewed a year ago
  • Turns out it was

    4. Ensure that your SageMaker SDK is up to date. Sometimes, updating to the latest version can resolve compatibility issues.

    Updated the sagemaker python dependency to "^2.232.3" (latest) from "^2.220.0" and it started working again.

    Not quite sure why the working code stopped working even though the imaged hadn't changed...

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.