Overview
This article suggests how to install NVIDIA GPU driver, CUDA Toolkit, NVIDIA Container Toolkit on NVIDIA GPU EC2 instances running AL2023 (Amazon Linux 2023)
Note that by using this method, you agree to NVIDIA Driver License Agreement, End User License Agreement and other related license agreement. If you are doing development, you may want to register for NVIDIA Developer Program.
Pre-built AMIs
If you need AMIs preconfigured with TensorFlow, PyTorch, NVIDIA CUDA drivers and libraries, consider AWS Deep Learning AMIs.
Refer to Release notes for DLAMIs for currently supported options.
For container workloads, consider Amazon ECS-optimized Linux AMIs and Amazon EKS optimized AMIs
Note: instructions in this article are not applicable to pre-built AMIs.
About CUDA toolkit
CUDA Toolkit is generally optional when GPU instance is used to run applications (as opposed to develop applications) as the CUDA application typically packages (by statically or dynamically linking against) the CUDA runtime and libraries needed.
Prepare Amazon Linux 2023
Launch a new NVIDIA GPU instance running Amazon Linux 2023 preferably with at least 20 GB storage and connect to the instance
dnf check-release-update
sudo dnf update -y
sudo dnf install -y dkms kernel-devel kernel-modules-extra
sudo systemctl enable --now dkms
You may want to use a newer version of DNF repository if prompted by check-release-update
.
sudo dnf upgrade --releasever=latest -y
More information at Checking for newer repository versions with dnf check-release-update
Restart your AL2023 if kernel is updated.
sudo reboot
Install NVIDIA driver and CUDA toolkit
Method 1: Package Manager Installation (x86_64)
CUDA version 12.5 and higher supports Amazon Linux 2023 package manager installation on x86_64 only. Use Method 2
for Graviton arm64.
Add repo
You can choose either NVIDIA or AL2023 repository
Option 1: NVIDIA repo
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/x86_64/cuda-amzn2023.repo
Option 2: AL2023 repo
nvidia-release
was added to 2023.6.20241031 release
sudo dnf install -y nvidia-release
Install NVIDIA driver
Option 1: NVIDIA repo
sudo dnf module install -y nvidia-driver:latest-dkms
To install a specific version, e.g. 565
sudo dnf module install -y nvidia-driver:565-dkms
The above install NVIDIA Proprietary kernel module. Refer to Driver Installation Guide about NVIDIA Kernel Modules and installation options.
Option 2: AL2023 repo
sudo dnf install -y nvidia-driver
Install CUDA toolkit
sudo dnf install -y cuda-toolkit
To install a specific version, e.g. 12.6
sudo dnf install -y cuda-toolkit-12-6
Refer to CUDA documentation for installation options
Method 2: Runfile Installation (x86_64 and arm64)
CUDA Toolkit 12.5 currently supports AL2023 x86_64 rpm install. Runfile is not supported for AL2023 and may not work.
Ensure EC2 instance has more than 10 GB of free disk space
Install development libraries
sudo dnf install -y vulkan-devel libglvnd-devel elfutils-libelf-devel xorg-x11-server-Xorg
Option 1: NVIDIA driver only
To install NVIDIA driver version 565.57.01
cd /tmp
DRIVER_VERSION=565.57.01
curl -L -O https://us.download.nvidia.com/tesla/$DRIVER_VERSION/NVIDIA-Linux-$(arch)-$DRIVER_VERSION.run
chmod +x ./NVIDIA-Linux-$(arch)-$DRIVER_VERSION.run
sudo ./NVIDIA-Linux-$(arch)-$DRIVER_VERSION.run -s
To install a specific version, refer to Driver Release Notes and modify the above line that sets DRIVER_VERSION
value
Option 2: NVIDIA driver and/or CUDA toolkit
You can go to CUDA Toolkit download page to obtain latest runfile (local)
installer download URL for RHEL 9 on x86_64 and arm64 sbsa.
cd /tmp
if (arch | grep -q x86); then
wget https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux.run
else
wget https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux_sbsa.run
fi
chmod +x ./cuda*.run
To install another version, refer to CUDA Toolkit Archive for runfile (local)
download link.
Option 2a: NVIDIA driver and CUDA toolkit
sudo ./cuda_*.run --driver --toolkit --tmpdir=/var/tmp --silent
Option 2b: CUDA toolkit only
sudo ./cuda_*.run --toolkit --tmpdir=/var/tmp --silent
To troubleshoot compilation, view contents of /var/log/nvidia-installer.log
and /var/log/cuda-installer.log
(if applicable)
Refer to CUDA documentation for installation options
Runfile Uninstallation
To uninstall CUDA Toolkit, run the uninstallation script provided in the bin directory of the toolkit. For version 12.6
sudo /usr/local/cuda-12.6/bin/cuda-uninstaller
To remove NVIDIA driver
sudo /usr/bin/nvidia-uninstall
Post installation
Restart your OS
sudo reboot
Verify NVIDIA driver
nvidia-smi
Output should be similar to below
Fri Nov 22 07:04:05 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA T4G Off | 00000000:00:1F.0 Off | 0 |
| N/A 77C P0 34W / 70W | 1MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Verify CUDA tookit
/usr/local/cuda/bin/nvcc -V
Output should be similar to below
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Oct_30_00:08:18_PDT_2024
Cuda compilation tools, release 12.6, V12.6.85
Build cuda_12.6.r12.6/compiler.35059454_0
Post-installation Actions
Refer to NVIDIA CUDA Installation Guide for Linux for post-installation actions before CUDA Toolkit can be used. For example, you may want to modify your PATH
environment variable to include /usr/local/cuda/bin
. For runfile installation, modify LD_LIBRARY_PATH
to include /usr/local/cuda/lib
Optional: NVIDIA Container Toolkit
NVIDIA Container toolkit supports AL2023 on both x86_64 and arm64, and is available from either NVIDIA or AL2023 nvidia-release
repository.
For arm64, use g5g.2xlarge
or larger instance size as g5g.xlarge
may cause failures due to the limited system memory.
if (! dnf search nvidia | grep -q nvidia-container-toolkit); then
sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
fi
sudo dnf install -y nvidia-container-toolkit
Refer to NVIDIA Container toolkit documentation about supported platforms, prerequisites and installation options
Verify Container Toolkit
nvidia-container-cli -V
Output should be similar to below
cli-version: 1.17.2
lib-version: 1.17.2
build date: 2024-11-15T18:08+0000
build revision: 63d366ee3b4183513c310ac557bf31b05b83328f
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: aarch64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
Container engine configuration
Refer to NVIDIA Container Toolkit site for container engine configuration instructions.
Docker
To install and configure docker
sudo dnf install -y docker
sudo systemctl enable docker
sudo usermod -aG docker ec2-user
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verify Docker engine configuration
To verify docker configuration
sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2023 nvidia-smi
Output should be similar to below
Unable to find image 'public.ecr.aws/amazonlinux/amazonlinux:2023' locally
2023: Pulling from amazonlinux/amazonlinux
aa4cd91a1805: Pull complete
Digest: sha256:5faca3faac3f514a7b8da1801caf87acec0b53623675de4c72f346fa4d1790ea
Status: Downloaded newer image for public.ecr.aws/amazonlinux/amazonlinux:2023
Fri Nov 22 07:05:30 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA T4G Off | 00000000:00:1F.0 Off | 0 |
| N/A 68C P0 32W / 70W | 1MiB / 15360MiB | 9% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Install NVIDIA driver, CUDA toolkit and Container Toolkit on EC2 instance at launch
To install the above including docker when launching a new AL2023 GPU instance, you can use the following as user data script.
#!/bin/bash
dnf check-release-update
sudo dnf update -y
sudo dnf install -y dkms kernel-devel kernel-modules-extra
sudo systemctl enable dkms
cd /tmp
if (arch | grep -q x86); then
sudo dnf install -y nvidia-release
sudo dnf install -y nvidia-driver
sudo dnf install -y cuda-toolkit
else
sudo dnf install -y vulkan-devel libglvnd-devel elfutils-libelf-devel xorg-x11-server-Xorg
curl -L -O https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda_12.6.3_560.35.05_linux_sbsa.run
chmod +x ./cuda*.run
sudo ./cuda_*.run --driver --toolkit --tmpdir=/var/tmp --silent
fi
if (! dnf search nvidia | grep -q nvidia-container-toolkit); then
sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
fi
sudo dnf install -y nvidia-container-toolkit
sudo dnf install -y docker
sudo systemctl enable docker
sudo usermod -aG docker ec2-user
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
sudo reboot
Verify
Connect to your EC2 instance
nvidia-smi
/usr/local/cuda/bin/nvcc -V
nvidia-container-cli -V
sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2023 nvidia-smi
View /var/log/cloud-init-output.log
to troubleshoot any installation issues.
Perform post-installation actions in order to use CUDA toolkit. To verify integrity of installation, you can download, compile and run CUDA samples such as deviceQuery.