SageMaker PyTorch Endpoint: NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)

0

Hi, I wanted to raise awareness on this (please direct me if this is not the place to do so). I created a SageMaker endpoint and pass an image through the endpoint. It causes the error I've attached below. I've attached the CloudWatch image which indicates a function is missing in the pynvml library. I created a requirements.txt which installs the nvgpu and pynvml, but the log displayed that they already exist. The relevant topic I could find is here: https://github.com/pytorch/serve/issues/1813. For comprehension sake, I checked the logs and the Torchserve version is 0.7.1. The last activity on that github was last year so I was curious if anyone has found a solution. I appreciate any help!

I created an endpoint in SageMaker as such:

from sagemaker.pytorch.model import PyTorchModel

pytorch_model = PyTorchModel(
    model_data= model_bucket,
    role=role,
    entry_point='inference.py',
    source_dir='code',
    py_version="py39",
    framework_version="1.13",
)

predictor = pytorch_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge",
)

I then call the endpoint to predict:

# Load and encode the image
import base64

with open('zebra.jpg', 'rb') as img:
    image = img.read()

image_base64 = base64.b64encode(image).decode('utf-8')

response = predictor.predict(image_base64, initial_args={'ContentType': 'application/x-image'})

The error message I receive specifically is the following which directs me to CloudWatch.

An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from primary and could not load the entire response body.

CloudWatch Error Message

Samuel
asked a year ago649 views
1 Answer
0

Hi, look at comment on https://stackoverflow.com/questions/73591281/nvml-cannot-load-methods-nvmlerror-functionnotfound

"update your GPU driver to the latest. pynvml 11.4.1 expects a driver install that is consistent with CUDA 11.4"

You may need to do same with you own config

profile pictureAWS
EXPERT
answered a year ago
  • Thank you for the fast reply. I will look into how to update the GPU driver for AWS SageMaker instances. Do you have any suggestions on how to go about this?

  • It is easier to change your pynvml version. Activate your env/kernel with conda activate <env_name> and pip install pynvml==<version>.

  • Did you manage to fix this? I am running into the same issue.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions