- Newest
- Most votes
- Most comments
By now, I know about three other people having the same problem. But nobody found a solution yet. Can we get a statement from AWS on this?
Same issue here :/
nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
lsmod | grep nvidia
(no output)
dmesg | grep "Linux version"
0.000000 Linux version 4.4.0-1077-aws (buildd@lcy01-amd64-021) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #87-Ubuntu SMP Wed Mar 6 00:03:05 UTC 2019 (Ubuntu 4.4.0-1077.87-aws 4.4.170)
Edit:
Same after updating packages with:
sudo apt-get update
sudo apt-get upgrade
Edited by: aureliencluzeau on Mar 26, 2019 5:51 AM
I'm having the same issue :
'''NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver'''
The solution seems to manually download and update the kernel as well as the Nvidia driver, but I dont want to mess with my setup unless I absolutely have to.
I'm having the same issue on p3.2xlarge instances.
Hello sgasse,
As you have correctly observed, the Deep Learning Ubuntu AMI family you are using autonomously performs updates in the background.
You can see the systemd timers/services responsible for this by running:
systemctl list-units --all | grep apt
This happens independently of the instance type used.
In the case you and others in this thread described, the unattended upgrade must have included a newer kernel which became active when the instance was rebooted. Normally, the Dynamic Kernel Module Support (DKMS) framework would automatically ensure that external kernel modules are recompiled to work with the updated kernel. However, with the NVIDIA driver's DKMS configuration used in older versions of the Deep Learning Ubuntu AMI, the recompilation would fail if gcc was also updated.
This has been addressed in "Deep Learning AMI (Ubuntu 18.04) Version 28.0" (and later) by including the --no-cc-version-check option when installing the NVIDIA driver as part of the image build process, e.g.:
sudo sh ./nvidia_driver.run --no-drm --disable-nouveau --dkms --silent --no-cc-version-check --install-libglvnd
On an older AMI, you can apply a workaround without having to re-install the driver (replace "440.33.01" with your actual version):
Open /usr/src/nvidia-440.33.01/dkms.conf
Add IGNORE_CC_MISMATCH=1 to the MAKE command
Run dkms install -m nvidia/440.33.01
Note: We have made NVIDIA aware that this option may also feasibly lead to recompilation failures in the future if an updated gcc has a change that is not backwards-compatible with an older driver's build process. There is not yet a perfect solution here that is guaranteed always fully transparent to customers, while keeping instances as secure as possible.
- Stanislav
Relevant content
- asked 10 months ago
- asked 8 months ago
- asked 7 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 9 months ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 10 months ago