Sagemaker Inference Endpoint Creation Failure

0

I'm trying to launch an inference endpoint on an ml.g4dn.2xlarge with a ~44GB model image. Every time I try to create the endpoint, it is in the creating state for 30-40 minutes and then fails with an error that says "Request to service failed. If failure persists after retry, contact customer support." There are no logs available in Cloudwatch. I get the same error for both real time and async endpoints.

I can use the same image in a Batch Transform job and the service starts up without throwing errors.

I'd love to know how to solve this issue, but I'd settle for a way to get debugging information.

  • Have you tried different type of instance or different AZ ?

  • I have tried ml.g4dn.2xlarge and ml.g4dn.xlarge. In my case, I need to use the g4dn family of machines.

  • how are you deploying your model. Can you share example code.

  • I'm just using the console to create the Model, EndpointConfig, and Endpoint right now, so no code to share for those steps.

    I spent time analyzing the CloudTrail events, and I can see that sagemaker goes through the process of downloading all of the image layers 4x before finally failing. None of the API calls available in CloudTrail report errors, but I think there must be a failure that's happening after the image layers have been downloaded that is triggering a series of retries. I'm stumped as to what that failure might be, since there are no events or logs associated with it.

Jeremy
asked a year ago113 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions