Sagemaker g4 and g5 instances do not have working nvidia-drivers

3

I am a heavy user of g4 and g5 instances on Sagemaker (notebook instances). Today when I tried to use the same instances as I always do I was met with the following when running nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

These are all the exact same instances and workloads I have used before. The same message was found when trying to run on ec2 natively as well.

preguntada hace un año921 visualizaciones
2 Respuestas
0

Hi,

This is a NVIDIA driver issue which affect all NVIDA functions.

Could you please try using the following cmds to unblock

sudo dkms remove nvidia/510.47.03 --all

sudo dkms install nvidia/510.47.03 -k $(uname -r)

Please let me know if this would work.

AWS
respondido hace un año
0

It seems that the same issue arises when using g5 instances with AWS Batch, when using latest Amazon ECS GPU-optimized AMI (2022-11-18). The ECS agent can't start because of drivers issues. Any timeline for this drivers fix?

respondido hace un año

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas