CUDA install fails on Amazon Linux, g4dn.xlarge

0

Hi, I am trying to install CUDA on an AWS EC2 g4dn.xlarge instance. I'm following the instructions from here: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/. I did this a month ago or so and it worked fine, but now it started failing :/ I've already spent some time investigating this, but I cannot resolve the problem. Any help would be greatly appreciated!

I'm starting from a fresh basic Amazon Linux image (ami-0669b163befffbdfc). I do the pre-installation actions:

[ec2-user@i-03018febbaff59efb ~]$ lspci | grep -i nvidia
00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

[ec2-user@i-03018febbaff59efb ~]$ uname -m && cat /etc/*release
x86_64
Amazon Linux release 2023 (Amazon Linux)
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
SUPPORT_END="2028-03-15"
Amazon Linux release 2023 (Amazon Linux)

It doesn't come with gcc, so install it and then

[ec2-user@i-03018febbaff59efb ~]$ gcc --version
gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Then, I do follow the instructions for Fedora (Section 3.6.1)

I have

[ec2-user@i-03018febbaff59efb ~]$ uname -r
6.1.61-85.141.amzn2023.x86_64

and I install the relevant kernel-devel and kernel-headers.

Removing the outdated keys fails:

[ec2-user@i-03018febbaff59efb ~]$ sudo rpm --erase gpg-pubkey-7fa2af80*
error: package gpg-pubkey-7fa2af80* is not installed

Then, I follow the network repo installation for fedora37. Installing the nvidia-drivers and cuda-toolkit seems fine (logs are long so I posted them here: https://github.com/jedreky/cloud-model-deployment/tree/main/tmp). Then, I do the final instructions (3.6.4). It seems that the libcuda.so is where it's supposed to be:

[ec2-user@i-03018febbaff59efb lib64]$ ls /usr/lib64/libcuda* -l
lrwxrwxrwx. 1 root root       20 Nov  7 05:22 /usr/lib64/libcuda.so -> libcuda.so.545.23.08
lrwxrwxrwx. 1 root root       20 Nov  7 05:22 /usr/lib64/libcuda.so.1 -> libcuda.so.545.23.08
-rwxr-xr-x. 1 root root 29453200 Nov  7 00:49 /usr/lib64/libcuda.so.545.23.08
lrwxrwxrwx. 1 root root       28 Nov  7 05:22 /usr/lib64/libcudadebugger.so.1 -> libcudadebugger.so.545.23.08
-rwxr-xr-x. 1 root root 10593576 Nov  7 00:14 /usr/lib64/libcudadebugger.so.545.23.08

And then I do the post-installation actions to get:

[ec2-user@i-03018febbaff59efb local]$ echo $PATH
/usr/local/cuda-12.3/bin:/home/ec2-user/.local/bin:/home/ec2-user/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin

[ec2-user@i-03018febbaff59efb local]$ echo $LD_LIBRARY_PATH
/usr/local/cuda-12.3/lib64

After all of this, nvidia-smi fails:

[ec2-user@i-03018febbaff59efb ~]$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

This is what I find in dmesg:

[ec2-user@i-03018febbaff59efb local]$ dmesg
...
[    2.690763] nvidia: loading out-of-tree module taints kernel.
[    2.691647] nvidia: module license 'NVIDIA' taints kernel.
[    2.692444] Disabling lock debugging due to kernel taint
[    2.742885] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    2.744378] nvidia: Unknown symbol drm_gem_object_free (err -2)
[    2.838904] zram_generator::config[1757]: zram0: system has too much memory (15779MB), limit is 800MB, ignoring.
[    2.839269] systemd-sysv-generator[1755]: SysV service '/etc/rc.d/init.d/cfn-hup' lacks a native systemd unit file. Automatically generating a unit file for compatibility. Please update package to include a native systemd unit file, in order to make it more safe and robust.
[    2.845374] nvidia: Unknown symbol drm_gem_object_free (err -2)
[    3.229924] RPC: Registered named UNIX socket transport module.
[    3.230597] RPC: Registered udp transport module.
[    3.231125] RPC: Registered tcp transport module.
[    3.231645] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    3.353325] ena 0000:00:05.0 ens5: Local page cache is disabled for less than 16 channels
[   21.054526] nvidia: Unknown symbol drm_gem_object_free (err -2)
[   21.157303] nvidia: Unknown symbol drm_gem_object_free (err -2)
[   21.286533] nvidia: Unknown symbol drm_gem_object_free (err -2)

lsmod output:

[ec2-user@i-03018febbaff59efb ~]$ lsmod
Module                  Size  Used by
nls_ascii              16384  1
sunrpc                692224  1
nls_cp437              20480  1
vfat                   24576  1
fat                    86016  1 vfat
ghash_clmulni_intel    16384  0
aesni_intel           393216  0
wmi                    36864  0
crypto_simd            16384  1 aesni_intel
i8042                  45056  0
cryptd                 28672  2 crypto_simd,ghash_clmulni_intel
i2c_core              106496  0
serio                  28672  3 i8042
ena                   163840  0
button                 24576  0
sch_fq_codel           20480  5
dm_mod                188416  0
fuse                  163840  1
loop                   32768  0
configfs               57344  1
dax                    45056  1 dm_mod
dmi_sysfs              20480  0
crc32_pclmul           16384  0
crc32c_intel           24576  0
efivarfs               24576  1

If I try to start the nvidia persistence daemon I get this in the system log:

Nov 23 10:21:50 i-03018febbaff59efb.eu-central-1.compute.internal nvidia-persistenced[2378]: Shutdown (2378)
Nov 23 10:21:50 i-03018febbaff59efb.eu-central-1.compute.internal nvidia-persistenced[2378]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permis>
Nov 23 10:21:50 i-03018febbaff59efb.eu-central-1.compute.internal kernel: nvidia: Unknown symbol drm_gem_object_free (err -2)
Nov 23 10:21:50 i-03018febbaff59efb.eu-central-1.compute.internal nvidia-persistenced[2378]: Started (2378)

Any help will be very much appreciated :)

  • Ok, so this

    sudo dnf install kernel-modules-extra.x86_64
    

    solves the problem :)

jedreky
gefragt vor 6 Monaten421 Aufrufe
Keine Antworten

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen