Sagemaker Inference Endpoint Creation Failure

I'm trying to launch an inference endpoint on an ml.g4dn.2xlarge with a ~44GB model image. Every time I try to create the endpoint, it is in the creating state for 30-40 minutes and then fails with an error that says "Request to service failed. If failure persists after retry, contact customer support." There are no logs available in Cloudwatch. I get the same error for both real time and async endpoints.

I can use the same image in a Batch Transform job and the service starts up without throwing errors.

I'd love to know how to solve this issue, but I'd settle for a way to get debugging information.

Hemant Chugh
há um ano
Have you tried different type of instance or different AZ ?
Jeremy
há um ano
I have tried ml.g4dn.2xlarge and ml.g4dn.xlarge. In my case, I need to use the g4dn family of machines.
Arun Lokanatha
há um ano
how are you deploying your model. Can you share example code.
Jeremy
há um ano
I'm just using the console to create the Model, EndpointConfig, and Endpoint right now, so no code to share for those steps.

I spent time analyzing the CloudTrail events, and I can see that sagemaker goes through the process of downloading all of the image layers 4x before finally failing. None of the API calls available in CloudTrail report errors, but I think there must be a failure that's happening after the image layers have been downloaded that is triggering a series of retries. I'm stumped as to what that failure might be, since there are no events or logs associated with it.

Tópicos

Machine Learning e IA

Conteúdo relevante

Como resolvo o erro "endpoint does not support the Availability Zone" quando tento mapear um endpoint da Amazon VPC?
AWS OFICIALAtualizada há 7 meses
Como soluciono o erro “The security token included in the request is expired” (O token de segurança incluído na solicitação está expirado) ao executar aplicações Java no Amazon EC2?
AWS OFICIALAtualizada há 2 anos
Por que meu endpoint do Amazon SageMaker entra no estado com falha quando crio ou atualizo um endpoint?
AWS OFICIALAtualizada há um ano
How do I connect to my WorkSpace with RDP?
AWS OFICIALAtualizada há 2 anos