GPU fails to intialize on g5.xlarge instance

0

Hello,

I have tried to create several g5.xlarge innstance with various AMI "quickstart" (Deep Learning AMI GPU TensorFlow 2.7.0 (Amazon Linux 2) 20211111 - ami-0850c76a5926905fb, Deep Learning AMI (Ubuntu 18.04) Version 54.0, ...)

In all cases, the instances is booting OK. Status checks are both OK, but the GPU is not accessible.

For example with AMI (Ubuntu 18.04) Version 54.0

nvidia-smi gives the error

nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

With 'dmesg' we can see the following errors:

[  308.148743] nvidia: probe of 0000:00:1e.0 failed with error -1
[  308.148756] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  308.148756] NVRM: None of the NVIDIA devices were initialized.
[  308.148969] nvidia-nvlink: Unregistered the Nvlink Core, major device number 239

The nvidia drivers installed are

apt list --installed | grep -i nvidia

libnvidia-container-tools/bionic,now 1.7.0-1 amd64 [installed,automatic]
libnvidia-container1/bionic,now 1.7.0-1 amd64 [installed,automatic]
nvidia-container-toolkit/bionic,now 1.7.0-1 amd64 [installed]
nvidia-docker2/bionic,now 2.8.0-1 all [installed]
nvidia-fabricmanager-450/now 450.142.00-1 amd64 [installed,upgradable to: 450.156.00-0ubuntu0.18.04.1]

The driver are not updated when doing a system update (i tried to unhold the package, update the system but it does not solve the issue)

apt-mark showhold
linux-aws
linux-headers-aws
linux-image-aws
nvidia-fabricmanager-450
tensorflow-model-server-neuron

Any idea of what i could try to solve the issue ?

Or do you know another Deep Learning AMI image that would work fine with this g5.xlarge ?

Thanks !

posta 2 anni fa5027 visualizzazioni
1 Risposta
1

For EC2 G5 instances, you will need to use a Deep Learning AMI with CUDA 11.4 or later. References to those can be found in the Deep Learning AMI documentation.

AWS
ESPERTO
con risposta 2 anni fa
profile pictureAWS
ESPERTO
verificato 2 anni fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande