Sagemaker g4 and g5 instances do not have working nvidia-drivers

3

I am a heavy user of g4 and g5 instances on Sagemaker (notebook instances). Today when I tried to use the same instances as I always do I was met with the following when running nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

These are all the exact same instances and workloads I have used before. The same message was found when trying to run on ec2 natively as well.

2 Antworten
0

Hi,

This is a NVIDIA driver issue which affect all NVIDA functions.

Could you please try using the following cmds to unblock

sudo dkms remove nvidia/510.47.03 --all

sudo dkms install nvidia/510.47.03 -k $(uname -r)

Please let me know if this would work.

AWS
beantwortet vor einem Jahr
0

It seems that the same issue arises when using g5 instances with AWS Batch, when using latest Amazon ECS GPU-optimized AMI (2022-11-18). The ECS agent can't start because of drivers issues. Any timeline for this drivers fix?

beantwortet vor einem Jahr

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen