Sagemaker g4 and g5 instances do not have working nvidia-drivers

3

I am a heavy user of g4 and g5 instances on Sagemaker (notebook instances). Today when I tried to use the same instances as I always do I was met with the following when running nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

These are all the exact same instances and workloads I have used before. The same message was found when trying to run on ec2 natively as well.

2 回答
0

Hi,

This is a NVIDIA driver issue which affect all NVIDA functions.

Could you please try using the following cmds to unblock

sudo dkms remove nvidia/510.47.03 --all

sudo dkms install nvidia/510.47.03 -k $(uname -r)

Please let me know if this would work.

AWS
已回答 1 年前
0

It seems that the same issue arises when using g5 instances with AWS Batch, when using latest Amazon ECS GPU-optimized AMI (2022-11-18). The ECS agent can't start because of drivers issues. Any timeline for this drivers fix?

已回答 1 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则