SageMaker PyTorch Endpoint: NVMLError(NVML_ERROR_FUNCTION_NOT_FOUND)

0

Hi, I wanted to raise awareness on this (please direct me if this is not the place to do so). I created a SageMaker endpoint and pass an image through the endpoint. It causes the error I've attached below. I've attached the CloudWatch image which indicates a function is missing in the pynvml library. I created a requirements.txt which installs the nvgpu and pynvml, but the log displayed that they already exist. The relevant topic I could find is here: https://github.com/pytorch/serve/issues/1813. For comprehension sake, I checked the logs and the Torchserve version is 0.7.1. The last activity on that github was last year so I was curious if anyone has found a solution. I appreciate any help!

I created an endpoint in SageMaker as such:

from sagemaker.pytorch.model import PyTorchModel

pytorch_model = PyTorchModel(
    model_data= model_bucket,
    role=role,
    entry_point='inference.py',
    source_dir='code',
    py_version="py39",
    framework_version="1.13",
)

predictor = pytorch_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge",
)

I then call the endpoint to predict:

# Load and encode the image
import base64

with open('zebra.jpg', 'rb') as img:
    image = img.read()

image_base64 = base64.b64encode(image).decode('utf-8')

response = predictor.predict(image_base64, initial_args={'ContentType': 'application/x-image'})

The error message I receive specifically is the following which directs me to CloudWatch.

An error occurred (ModelError) when calling the InvokeEndpoint operation: Received server error (500) from primary and could not load the entire response body.

CloudWatch Error Message

Samuel
gefragt vor einem Jahr664 Aufrufe
1 Antwort
0

Hi, look at comment on https://stackoverflow.com/questions/73591281/nvml-cannot-load-methods-nvmlerror-functionnotfound

"update your GPU driver to the latest. pynvml 11.4.1 expects a driver install that is consistent with CUDA 11.4"

You may need to do same with you own config

profile pictureAWS
EXPERTE
beantwortet vor einem Jahr
  • Thank you for the fast reply. I will look into how to update the GPU driver for AWS SageMaker instances. Do you have any suggestions on how to go about this?

  • It is easier to change your pynvml version. Activate your env/kernel with conda activate <env_name> and pip install pynvml==<version>.

  • Did you manage to fix this? I am running into the same issue.

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen