EC2 p2.xlarge instance losing GPU support after reboot

0

Since Tuesday, I face the odd problem that some of my p2.xlarge instances lost GPU support with no apparent cause. Today, I was able to recreate the problem with a minimal example that should be reproducible for everyone.

Basically, after shutdown and reboot, the instance no longer has the nvidia module loaded in the kernel. Furthermore, according to dmesg, there seems to be a different kernel loaded. All of this happens without me actively causing it.

Here are the steps to reproduce the problem using a fresh instance and no custom code. I am working in Ireland (eu-west-1), the instance was launched in the Availability Zone eu-west-1a:

Launched an instance with the "Deep Learning AMI (Ubuntu) Version 21.2 (ami-0e9085a8d461c2d01)
Instance type: p2.xlarge, all defaults

Logged into instance, only ran the following four commands:

ubuntu@...:~$ lsmod | grep nvidia
nvidia              16592896  0
ipmi_msghandler        49152  1 nvidia
dmesg | less
...
[    0.000000] Linux version 4.4.0-1075-aws (buildd@lgw01-amd64-035) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #85-Ubuntu SMP Thu Jan 17 17:15:12 UTC 2019 (Ubuntu 4.4.0-1075.85-aws 4.4.167)
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-1075-aws root=UUID=96950bba-70e8-4a4b-9d78-d2bc1c767e04 ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295
...
ubuntu@...:~$ nvidia-smi
Tue Mar 19 16:41:53 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   42C    P8    32W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
ubuntu@...:~$ sudo shutdown now

The instance does not shut down right away, maybe it is running updates that however I have NOT actively triggered.
After the state showed "stopped", started the instance again via the AWS Management Console
Ran the first three commands:

ubuntu@...:~$ lsmod | grep nvidia
(no output)
dmesg | less
...
[    0.000000] Linux version 4.4.0-1077-aws (buildd@lcy01-amd64-021) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #87-Ubuntu SMP Wed Mar 6 00:03:05 UTC 2019 (Ubuntu 4.4.0-1077.87-aws 4.4.170)
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-4.4.0-1077-aws root=UUID=96950bba-70e8-4a4b-9d78-d2bc1c767e04 ro console=tty1 console=ttyS0 nvme.io_timeout=4294967295
...
ubuntu@...:~$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Apparently, the kernel seems to change from version 4.4.0-1075-aws to version 4.4.0-1077-aws. What is causing this? How can I prevent it?

sgasse
asked 5 years ago1286 views
5 Answers
0

By now, I know about three other people having the same problem. But nobody found a solution yet. Can we get a statement from AWS on this?

sgasse
answered 5 years ago
0

Same issue here :/

nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

lsmod | grep nvidia
(no output)

dmesg | grep "Linux version"
0.000000 Linux version 4.4.0-1077-aws (buildd@lcy01-amd64-021) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.10) ) #87-Ubuntu SMP Wed Mar 6 00:03:05 UTC 2019 (Ubuntu 4.4.0-1077.87-aws 4.4.170)

Edit:
Same after updating packages with:
sudo apt-get update
sudo apt-get upgrade

Edited by: aureliencluzeau on Mar 26, 2019 5:51 AM

answered 5 years ago
0

I'm having the same issue :

'''NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver'''

The solution seems to manually download and update the kernel as well as the Nvidia driver, but I dont want to mess with my setup unless I absolutely have to.

answered 5 years ago
0
papjuli
answered 5 years ago
0

Hello sgasse,

As you have correctly observed, the Deep Learning Ubuntu AMI family you are using autonomously performs updates in the background.
You can see the systemd timers/services responsible for this by running:

systemctl list-units --all | grep apt

This happens independently of the instance type used.

In the case you and others in this thread described, the unattended upgrade must have included a newer kernel which became active when the instance was rebooted. Normally, the Dynamic Kernel Module Support (DKMS) framework would automatically ensure that external kernel modules are recompiled to work with the updated kernel. However, with the NVIDIA driver's DKMS configuration used in older versions of the Deep Learning Ubuntu AMI, the recompilation would fail if gcc was also updated.

This has been addressed in "Deep Learning AMI (Ubuntu 18.04) Version 28.0" (and later) by including the --no-cc-version-check option when installing the NVIDIA driver as part of the image build process, e.g.:

sudo sh ./nvidia_driver.run --no-drm --disable-nouveau --dkms --silent --no-cc-version-check --install-libglvnd

On an older AMI, you can apply a workaround without having to re-install the driver (replace "440.33.01" with your actual version):

Open /usr/src/nvidia-440.33.01/dkms.conf
Add IGNORE_CC_MISMATCH=1 to the MAKE command
Run dkms install -m nvidia/440.33.01

Note: We have made NVIDIA aware that this option may also feasibly lead to recompilation failures in the future if an updated gcc has a change that is not backwards-compatible with an older driver's build process. There is not yet a perfect solution here that is guaranteed always fully transparent to customers, while keeping instances as secure as possible.

  • Stanislav
AWS
answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions