Sagemaker Inference Endpoint Creation Failure

0

I'm trying to launch an inference endpoint on an ml.g4dn.2xlarge with a ~44GB model image. Every time I try to create the endpoint, it is in the creating state for 30-40 minutes and then fails with an error that says "Request to service failed. If failure persists after retry, contact customer support." There are no logs available in Cloudwatch. I get the same error for both real time and async endpoints.

I can use the same image in a Batch Transform job and the service starts up without throwing errors.

I'd love to know how to solve this issue, but I'd settle for a way to get debugging information.

  • Have you tried different type of instance or different AZ ?

  • I have tried ml.g4dn.2xlarge and ml.g4dn.xlarge. In my case, I need to use the g4dn family of machines.

  • how are you deploying your model. Can you share example code.

  • I'm just using the console to create the Model, EndpointConfig, and Endpoint right now, so no code to share for those steps.

    I spent time analyzing the CloudTrail events, and I can see that sagemaker goes through the process of downloading all of the image layers 4x before finally failing. None of the API calls available in CloudTrail report errors, but I think there must be a failure that's happening after the image layers have been downloaded that is triggering a series of retries. I'm stumped as to what that failure might be, since there are no events or logs associated with it.

Jeremy
질문됨 일 년 전117회 조회
답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠