Sagemaker g4 and g5 instances do not have working nvidia-drivers

3

I am a heavy user of g4 and g5 instances on Sagemaker (notebook instances). Today when I tried to use the same instances as I always do I was met with the following when running nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

These are all the exact same instances and workloads I have used before. The same message was found when trying to run on ec2 natively as well.

質問済み 1年前911ビュー
2回答
0

Hi,

This is a NVIDIA driver issue which affect all NVIDA functions.

Could you please try using the following cmds to unblock

sudo dkms remove nvidia/510.47.03 --all

sudo dkms install nvidia/510.47.03 -k $(uname -r)

Please let me know if this would work.

AWS
回答済み 1年前
0

It seems that the same issue arises when using g5 instances with AWS Batch, when using latest Amazon ECS GPU-optimized AMI (2022-11-18). The ECS agent can't start because of drivers issues. Any timeline for this drivers fix?

回答済み 1年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ