Hi, I wanted to raise awareness on this (please direct me if this is not the place to do so). I created a SageMaker endpoint and pass an image through the endpoint. It causes the error I've attached below. I've attached the CloudWatch image which indicates a function is missing in the pynvml library. I created a requirements.txt which installs the nvgpu and pynvml, but the log displayed that they already exist. The relevant topic I could find is here: https://github.com/pytorch/serve/issues/1813. For comprehension sake, I checked the logs and the Torchserve version is 0.7.1. The last activity on that github was last year so I was curious if anyone has found a solution. I appreciate any help!
I created an endpoint in SageMaker as such:
from sagemaker.pytorch.model import PyTorchModel
pytorch_model = PyTorchModel(
model_data= model_bucket,
role=role,
entry_point='inference.py',
source_dir='code',
py_version="py39",
framework_version="1.13",
)
predictor = pytorch_model.deploy(
initial_instance_count=1,
instance_type="ml.g4dn.xlarge",
)
I then call the endpoint to predict:
# Load and encode the image
import base64
with open('zebra.jpg', 'rb') as img:
image = img.read()
image_base64 = base64.b64encode(image).decode('utf-8')
response = predictor.predict(image_base64, initial_args={'ContentType': 'application/x-image'})
The error message I receive specifically is the following which directs me to CloudWatch.
An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from primary and could not load the entire response body.
Thank you for the fast reply. I will look into how to update the GPU driver for AWS SageMaker instances. Do you have any suggestions on how to go about this?
It is easier to change your pynvml version. Activate your env/kernel with
conda activate <env_name>
andpip install pynvml==<version>
.Did you manage to fix this? I am running into the same issue.