By using AWS re:Post, you agree to the AWS re:Post Terms of Use

How do I install NVIDIA GPU driver, CUDA toolkit, NVIDIA Container Toolkit on Amazon EC2 instances running Amazon Linux 2023 (AL2023)?

8 minute read
Content level: Expert
5

I want to install NVIDIA driver, CUDA toolkit, NVIDIA Container Toolkit on AL2023 (Amazon Linux 2023) (x86_64/arm64)

Overview

This article suggests how to install NVIDIA GPU driver, CUDA Toolkit, NVIDIA Container Toolkit on NVIDIA GPU EC2 instances running AL2023 (Amazon Linux 2023)

Note that by using this method, you agree to NVIDIA Driver License Agreement, End User License Agreement and other related license agreement. If you are doing development, you may want to register for NVIDIA Developer Program.

Pre-built AMIs

If you need AMIs preconfigured with TensorFlow, PyTorch, NVIDIA CUDA drivers and libraries, consider AWS Deep Learning AMIs. Refer to Release notes for DLAMIs for currently supported options.

For container workloads, consider Amazon ECS-optimized Linux AMIs and Amazon EKS optimized AMIs

Note: instructions in this article are not applicable to pre-built AMIs.

About CUDA toolkit

CUDA Toolkit is generally optional when GPU instance is used to run applications (as opposed to develop applications) as the CUDA application typically packages (by statically or dynamically linking against) the CUDA runtime and libraries needed.

Prepare Amazon Linux 2023

Launch a new NVIDIA GPU instance running Amazon Linux 2023 preferably with at least 20 GB storage and connect to the instance

dnf check-release-update
sudo dnf update -y
sudo dnf install -y dkms kernel-devel kernel-modules-extra
sudo systemctl enable --now dkms

You may want to use a newer version of DNF repository if prompted by check-release-update.

sudo dnf upgrade --releasever=latest -y

More information at Checking for newer repository versions with dnf check-release-update

Restart your AL2023 if kernel is updated.

sudo reboot

Install NVIDIA driver and CUDA toolkit

Method 1: Package Manager Installation (x86_64)

CUDA version 12.5 and higher supports Amazon Linux 2023 package manager installation on x86_64 only. Use Method 2 for Graviton arm64.

Add repo

You can choose either NVIDIA or AL2023 repository

Option 1: NVIDIA repo

sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo

Option 2: AL2023 repo

nvidia-release was added to 2023.6.20241031 release

sudo dnf install -y nvidia-release

Install NVIDIA driver

Option 1: NVIDIA repo

sudo dnf module install -y nvidia-driver:latest-dkms

To install a specific version, e.g. 565

sudo dnf module install -y nvidia-driver:565-dkms

The above install NVIDIA Proprietary kernel module. Refer to Driver Installation Guide about NVIDIA Kernel Modules and installation options.

Option 2: AL2023 repo

sudo dnf install -y nvidia-driver

Install CUDA toolkit

sudo dnf install -y cuda-toolkit

To install a specific version, e.g. 12.6

sudo dnf install -y cuda-toolkit-12-6

Refer to CUDA documentation for installation options

Method 2: Runfile Installation (x86_64 and arm64)

CUDA Toolkit 12.5 currently supports AL2023 x86_64 rpm install. Runfile is not supported for AL2023 and may not work.

Ensure EC2 instance has more than 10 GB of free disk space

Install development libraries

sudo dnf install -y vulkan-devel libglvnd-devel elfutils-libelf-devel xorg-x11-server-Xorg

Option 1: NVIDIA driver only

To install NVIDIA driver version 565.57.01

cd /tmp
DRIVER_VERSION=565.57.01
curl -L -O https://us.download.nvidia.com/tesla/$DRIVER_VERSION/NVIDIA-Linux-$(arch)-$DRIVER_VERSION.run
chmod +x ./NVIDIA-Linux-$(arch)-$DRIVER_VERSION.run
sudo ./NVIDIA-Linux-$(arch)-$DRIVER_VERSION.run -s

To install a specific version, refer to Driver Release Notes and modify the above line that sets DRIVER_VERSION value

Option 2: NVIDIA driver and/or CUDA toolkit

You can go to CUDA Toolkit download page to obtain latest runfile (local) installer download URL for RHEL 9 on x86_64 and arm64 sbsa.

cd /tmp
if (arch | grep -q x86); then
  wget https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux.run
else 
  wget https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux_sbsa.run
fi
chmod +x ./cuda*.run

To install another version, refer to CUDA Toolkit Archive for runfile (local) download link.

Option 2a: NVIDIA driver and CUDA toolkit

sudo ./cuda_*.run --driver --toolkit --tmpdir=/var/tmp --silent

Option 2b: CUDA toolkit only

sudo ./cuda_*.run  --toolkit --tmpdir=/var/tmp --silent

To troubleshoot compilation, view contents of /var/log/nvidia-installer.log and /var/log/cuda-installer.log (if applicable)

Refer to CUDA documentation for installation options

Runfile Uninstallation

To uninstall CUDA Toolkit, run the uninstallation script provided in the bin directory of the toolkit. For version 12.6

sudo /usr/local/cuda-12.6/bin/cuda-uninstaller

To remove NVIDIA driver

sudo /usr/bin/nvidia-uninstall

Post installation

Restart your OS

sudo reboot

Verify NVIDIA driver

nvidia-smi

Output should be similar to below

Fri Nov 22 07:04:05 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T4G                     Off |   00000000:00:1F.0 Off |                    0 |
| N/A   77C    P0             34W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                        
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Verify CUDA tookit

/usr/local/cuda/bin/nvcc -V

Output should be similar to below

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Oct_30_00:08:18_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0

Post-installation Actions

Refer to NVIDIA CUDA Installation Guide for Linux for post-installation actions before CUDA Toolkit can be used. For example, you may want to modify your PATH environment variable to include /usr/local/cuda/bin. For runfile installation, modify LD_LIBRARY_PATH to include /usr/local/cuda/lib

Optional: NVIDIA Container Toolkit

NVIDIA Container toolkit supports AL2023 on both x86_64 and arm64, and is available from either NVIDIA or AL2023 nvidia-release repository.

For arm64, use g5g.2xlarge or larger instance size as g5g.xlarge may cause failures due to the limited system memory.

if (! dnf search nvidia | grep -q nvidia-container-toolkit); then
  sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
fi
sudo dnf install -y nvidia-container-toolkit

Refer to NVIDIA Container toolkit documentation about supported platforms, prerequisites and installation options

Verify Container Toolkit

nvidia-container-cli -V

Output should be similar to below

cli-version: 1.17.2
lib-version: 1.17.2
build date: 2024-11-15T18:08+0000
build revision: 63d366ee3b4183513c310ac557bf31b05b83328f
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: aarch64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Container engine configuration

Refer to NVIDIA Container Toolkit site for container engine configuration instructions.

Docker

To install and configure docker

sudo dnf install -y docker
sudo systemctl enable docker
sudo usermod -aG docker ec2-user

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify Docker engine configuration

To verify docker configuration

sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2023 nvidia-smi

Output should be similar to below

Unable to find image 'public.ecr.aws/amazonlinux/amazonlinux:2023' locally
2023: Pulling from amazonlinux/amazonlinux
aa4cd91a1805: Pull complete 
Digest: sha256:5faca3faac3f514a7b8da1801caf87acec0b53623675de4c72f346fa4d1790ea
Status: Downloaded newer image for public.ecr.aws/amazonlinux/amazonlinux:2023
Fri Nov 22 07:05:30 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05              Driver Version: 560.35.05      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T4G                     Off |   00000000:00:1F.0 Off |                    0 |
| N/A   68C    P0             32W /   70W |       1MiB /  15360MiB |      9%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Install NVIDIA driver, CUDA toolkit and Container Toolkit on EC2 instance at launch

To install the above including docker when launching a new AL2023 GPU instance, you can use the following as user data script.

#!/bin/bash
dnf check-release-update
sudo dnf update -y
sudo dnf install -y dkms kernel-devel kernel-modules-extra
sudo systemctl enable dkms

cd /tmp
if (arch | grep -q x86); then
  sudo dnf install -y nvidia-release
  sudo dnf install -y nvidia-driver
  sudo dnf install -y cuda-toolkit
else
  sudo dnf install -y vulkan-devel libglvnd-devel elfutils-libelf-devel xorg-x11-server-Xorg
  curl -L -O https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux_sbsa.run
  chmod +x ./cuda*.run
  sudo ./cuda_*.run --driver --toolkit --tmpdir=/var/tmp --silent
fi

if (! dnf search nvidia | grep -q nvidia-container-toolkit); then
  sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
fi
sudo dnf install -y nvidia-container-toolkit

sudo dnf install -y docker
sudo systemctl enable docker
sudo usermod -aG docker ec2-user

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

sudo reboot

Verify

Connect to your EC2 instance

nvidia-smi
/usr/local/cuda/bin/nvcc -V
nvidia-container-cli -V
sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2023 nvidia-smi

View /var/log/cloud-init-output.log to troubleshoot any installation issues.

Perform post-installation actions in order to use CUDA toolkit. To verify integrity of installation, you can download, compile and run CUDA samples such as deviceQuery.

Amazon Linux 2023 on g4dn

5 Comments

This is great Mike!
Are there options for Graviton/ARM?

profile pictureAWS
EXPERT
replied 6 months ago

Hello, I get ERROR when run the sample workload

[root@ip bin]# docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-sm
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: driver rpc error: failed to process request: unknown.

I using AL2023 (ami-0b17ca9fb2a39a659) on a Graviton ARM (g5g.xlarge) any advice?

replied 4 months ago

Worked perfectly to build an ECS-optimized GPU-ready AMI based on Al2023 (ami-01c1ede61c128dc37)! Thank you so much for this post!

replied 4 months ago

Been trying to do exactly that on a g4dn.xlarge machine, using these steps and also a bunch of other variations.

Keep getting:

[ec2-user@ ~]$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

[ec2-user@ ~]$ lsmod | grep nvidia
[ec2-user@ ~]$ sudo modprobe nvidia
modprobe: ERROR: could not insert 'nvidia': Unknown symbol in module, or unknown parameter (see dmesg)
[ec2-user@ ~]$ sudo dmesg | grep -i nvidia
[    4.918126] nvidia: loading out-of-tree module taints kernel.
[    4.918717] nvidia: module license 'NVIDIA' taints kernel.
[    4.944984] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    4.946115] nvidia: Unknown symbol drm_gem_object_free (err -2)
[    5.054328] nvidia: Unknown symbol drm_gem_object_free (err -2)
[  547.271166] nvidia: Unknown symbol drm_gem_object_free (err -2)
[  547.370785] nvidia: Unknown symbol drm_gem_object_free (err -2)
[  547.449667] nvidia: Unknown symbol drm_gem_object_free (err -2)
[  845.532310] nvidia: Unknown symbol drm_gem_object_free (err -2)

Apparently one might get this if the driver doesn't match the kernel (makes sense), but at this point I'm pretty sure there's something else going on.

My goal is to run a fairly straightforward Stable Diffusion setup, and I possibly need newer Python that the 3.7 (I think) the preconfigured "Deep Learning" AL 2 AMIs come with.

replied 2 months ago

Confirmed that this works with the latest AL2023 AMI as long as you have at least 15GB of storage.

replied 7 days ago