Sagemaker endpoint running but constantly restarting


I have deployed a model to a Sagemaker endpoint using BentoML/BentoCTL. This is a tool for building APIs and containerizing models. To test, I use curl with a JSON payload to make a request. When I run the created docker container on my local machine I can successfully invoke it and get responses back. So I don't think the problem is in the docker image.

When I deploy to sagemaker, I receive the message {"message":"Service Unavailable"} as a response to my curl request. I can see the endpoint running in the Sagemaker/Endpoints dashboard. Viewing the cloudwatch logs, it appears that the the endpoint is constantly restarting. There are messages that are printed at startup (e.g. Tensorflow loading messages) that are written to the log over and over.

I thought that this might be due to using an instance type with low memory (t2.medium) so I switched to m5.4xlarge as a test, but the result is the same.

What can I do? How can I determine what's causing the endless restarts?

2 Answers

When you mean restart? Does it mean "Updating" the endpoint? Do you have an autoscaling policy attached to the endpoint? Do you see any errors in the Cloudwatch logs?

answered 2 years ago
answered a year ago

