Sagemaker g4 and g5 instances do not have working nvidia-drivers

3

I am a heavy user of g4 and g5 instances on Sagemaker (notebook instances). Today when I tried to use the same instances as I always do I was met with the following when running nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

These are all the exact same instances and workloads I have used before. The same message was found when trying to run on ec2 natively as well.

2 Risposte
0

Hi,

This is a NVIDIA driver issue which affect all NVIDA functions.

Could you please try using the following cmds to unblock

sudo dkms remove nvidia/510.47.03 --all

sudo dkms install nvidia/510.47.03 -k $(uname -r)

Please let me know if this would work.

AWS
con risposta un anno fa
0

It seems that the same issue arises when using g5 instances with AWS Batch, when using latest Amazon ECS GPU-optimized AMI (2022-11-18). The ECS agent can't start because of drivers issues. Any timeline for this drivers fix?

con risposta un anno fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande