Sagemaker Inference Endpoint Creation Failure

0

I'm trying to launch an inference endpoint on an ml.g4dn.2xlarge with a ~44GB model image. Every time I try to create the endpoint, it is in the creating state for 30-40 minutes and then fails with an error that says "Request to service failed. If failure persists after retry, contact customer support." There are no logs available in Cloudwatch. I get the same error for both real time and async endpoints.

I can use the same image in a Batch Transform job and the service starts up without throwing errors.

I'd love to know how to solve this issue, but I'd settle for a way to get debugging information.

  • Have you tried different type of instance or different AZ ?

  • I have tried ml.g4dn.2xlarge and ml.g4dn.xlarge. In my case, I need to use the g4dn family of machines.

  • how are you deploying your model. Can you share example code.

  • I'm just using the console to create the Model, EndpointConfig, and Endpoint right now, so no code to share for those steps.

    I spent time analyzing the CloudTrail events, and I can see that sagemaker goes through the process of downloading all of the image layers 4x before finally failing. None of the API calls available in CloudTrail report errors, but I think there must be a failure that's happening after the image layers have been downloaded that is triggering a series of retries. I'm stumped as to what that failure might be, since there are no events or logs associated with it.

Jeremy
feita há um ano117 visualizações
Sem respostas

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas