Skip to content

[DLAMI - ISSUE] NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

0

Im using fresh instance of g6f and choosen AWS singe cuda DLAMI AMI. It supposed to come with drivers pre installed but To my surprise its not working now

AMI Name: Deep Learning Base AMI with Single CUDA (Amazon Linux 2023) Supported EC2 instances: G4dn, G5, G6, Gr6, G6e, P4d, P4de, P5, P5e, P5en, P6-B200, P6-B300 NVIDIA driver version: 580.105.08 CUDA versions available: cuda-13.0 Default CUDA version is 13.0

[ec2-user@ip-~]$ nvidia-smi NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

when i checked the logs:

$ sudo dmesg | grep -i nvidia [ 2.379460] systemd[1]: /etc/systemd/system/nvidia-cdi-refresh.service:26: Ignoring unknown escape sequences: "/(nvidia|nvidia-current).ko[:]" [ 2.920665] nvidia-nvlink: Nvlink Core is being initialized, major device number 240 [ 2.923767] NVRM: The NVIDIA vGPU 0000:31:00.0 (PCI ID: 10de:27b8) NVRM: nvidia.ko. NVRM: Firmware' sections in the NVIDIA Virtual GPU (vGPU) NVRM: Software documentation, available at docs.nvidia.com. [ 2.923974] nvidia: probe of 0000:31:00.0 failed with error -1

Same thing happening with ubuntu single CUDA DLAMI. Could someone help please. Tried multiple times. fresnh instaces and updating with

sudo dnf install -y kernel-modules-extra-$(uname -r)

sudo depmod -a sudo modprobe nvidia

sudo reboot

Nothing works.

asked 6 days ago66 views
2 Answers
2
Accepted Answer

As per AL2023 release notes, g6f is not in the list of supported instance types. The DLAMI use Tesla driver.

Supported EC2 instances: G4dn, G5, G6, Gr6, G6e, P4d, P4de, P5, P5e, P5en, P6-B200, P6-B300

From Available drivers by instance type, you need NVIDIA GRID driver. This article can help.

AWS
EXPERT
answered 6 days ago
EXPERT
reviewed 5 days ago
0

It appears you're experiencing an issue with the NVIDIA driver on your G6f instance using the Deep Learning AMI. The error message indicates that the NVIDIA driver is not communicating properly with the GPU.

Looking at your logs, I can see the key issue: nvidia: probe of 0000:31:00.0 failed with error -1 and the message about "NVIDIA Virtual GPU (vGPU)". This suggests that there might be a compatibility issue between the driver and the specific GPU in your G6f instance.

G6f instances use NVIDIA L4 GPUs, and they require specific driver support. Here are some steps to troubleshoot:

  1. First, verify that the G6f instance type is fully supported by the specific DLAMI version you're using. While the MOTD mentions G6 support, there might be specific requirements for the G6f variant.

  2. Consider using the AWS Deep Learning OSS Nvidia Driver AMI instead of the Base AMI with Single CUDA. The OSS Nvidia Driver AMI typically has better compatibility with various GPU instance types.

  3. If you continue to have issues, you could try installing the NVIDIA driver manually, ensuring it's compatible with the L4 GPU in your G6f instance.

  4. Another approach would be to use a different DLAMI version that might have better compatibility with G6f instances.

It's worth noting that sometimes newly released instance types or GPU variants might require updated drivers or specific configurations that aren't yet included in all DLAMI versions. In such cases, using the latest available DLAMI or manually updating the drivers might be necessary.
Sources
Resolve NVIDIA GPU and GPU driver issues in Amazon EC2 | AWS re:Post
Issues gdscheck -p - receiving: CUDA_ERROR_SYSTEM_NOT_READY | AWS re:Post
Using the Deep Learning AMI with Conda - Deep Learning AMI

answered 6 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.