How do I install NVIDIA GPU driver, CUDA Toolkit, NVIDIA Container Toolkit on Amazon EC2 instances running RHEL/Rocky Linux 8/9?
I want to install NVIDIA driver, CUDA Toolkit, NVIDIA Container Toolkit, and other NVIDIA software on RHEL/Rocky 8/9 (x86_64/arm64)
Overview
This article suggests how to install NVIDIA GPU driver, CUDA Toolkit, NVIDIA Container Toolkit and other NVIDIA software directly from NVIDIA repository on NVIDIA GPU EC2 instances running RHEL (Red Hat Enterprise Linux) or Rocky Linux.
Note that by using this method, you agree to NVIDIA Driver License Agreement, End User License Agreement and other related license agreement. If you are doing development, you may want to register for NVIDIA Developer Program.
Pre-built AMIs
If you need AMIs preconfigured with TensorFlow, PyTorch, NVIDIA CUDA drivers and libraries, consider AWS Deep Learning AMIs. Refer to Release notes for DLAMIs for currently supported options.
For container workloads, consider Amazon ECS-optimized Linux AMIs and Amazon EKS optimized AMIs
Note: instructions in this article are not applicable to pre-built AMIs.
GUI (graphical desktop) remote access
If you need remote graphical desktop access, refer to How do I install GUI (graphical desktop) on Amazon EC2 instances running RHEL/Rocky Linux 8/9?
Note that this article installs NVIDIA Tesla driver (also know as NVIDIA Datacenter Driver), which is intended primarily for GPU compute workloads. If configured in xorg.conf
, Tesla drivers support one display of up to 2560x1600 resolution. GRID drivers provide access to four 4K displays per GPU and are certified to provide optimal performance for professional visualization applications.
About CUDA toolkit
CUDA Toolkit is generally optional when GPU instance is used to run applications (as opposed to develop applications) as the CUDA application typically packages (by statically or dynamically linking against) the CUDA runtime and libraries needed.
System Requirements
This article covers the following platforms
- Red Hat Enterprise Linux (RHEL) 9 (x86_64 and arm64)
- Red Hat Enterprise Linux (RHEL) 8 (x86_64 and arm64)
- Rocky Linux 9 (x86_64)
- Rocky Linux 8 (x86_64)
While it may work, NVIDIA do not support Rocky Linux on arm64 architecture or other RHEL compatible Linux OSs such as AlmaLinux. Refer to Driver installation guide for supported kernel versions, compilers and libraries.
Prepare Rocky Linux / RHEL
Launch a new NVIDIA GPU instance preferably with at least 20 GB storage and connect to the instance
Update OS, add EPEL repository, install DKMS, kernel headers and development packages
sudo dnf update -y
OS_VERSION=$(. /etc/os-release;echo $VERSION_ID | sed -e 's/\..*//g')
if ( cat /etc/os-release | grep -q Red ); then
sudo subscription-manager repos --enable codeready-builder-for-rhel-$OS_VERSION-$(arch)-rpms
elif ( echo $OS_VERSION | grep -q 8 ); then
sudo dnf config-manager --set-enabled powertools
else
sudo dnf config-manager --set-enabled crb
fi
sudo dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-$OS_VERSION.noarch.rpm
sudo dnf install -y dkms kernel-devel kernel-modules-extra unzip gcc make vulkan-devel libglvnd-devel elfutils-libelf-devel xorg-x11-server-Xorg
sudo systemctl enable --now dkms
Restart your EC2 instance if kernel is updated
sudo reboot
Add NVIDIA repository
Configure Network Repo installation
DISTRO=$(. /etc/os-release;echo rhel$VERSION_ID | sed -e 's/\..*//g')
if (arch | grep -q x86); then
ARCH=x86_64
else
ARCH=sbsa
fi
sudo dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/$DISTRO/$ARCH/cuda-$DISTRO.repo
Install NVIDIA Driver
To install latest Tesla driver
sudo dnf module install -y nvidia-driver:latest-dkms
To install a specific version, e.g. 565
sudo dnf module install -y nvidia-driver:565-dkms
The above install NVIDIA Proprietary kernel module. Refer to Driver Installation Guide about NVIDIA Kernel Modules and installation options.
Verify
Restart your instance
nvidia-smi
Output should be similar to below
Sat Nov 9 05:06:35 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA T4G Off | 00000000:00:1F.0 Off | 0 |
| N/A 72C P8 13W / 70W | 1MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Optional: CUDA Toolkit
To install latest CUDA Toolkit
sudo dnf install -y cuda-toolkit
To install a specific version, e.g. 12.6
sudo dnf install -y cuda-toolkit-12-6
Refer to CUDA Toolkit documentation about supported platforms and installation options
Verify
/usr/local/cuda/bin/nvcc -V
Output should be similar to below
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Sep_12_02:27:38_PDT_2024
Cuda compilation tools, release 12.6, V12.6.77
Build cuda_12.6.r12.6/compiler.34841621_0
Post-installation Actions
Refer to NVIDIA CUDA Installation Guide for Linux for post-installation actions before CUDA Toolkit can be used. For example, you may want to include /usr/local/cuda/bin
to your PATH
variable as per Post-installation Actions: Mandatory Actions
Optional: NVIDIA Container Toolkit
NVIDIA Container toolkit supports RHEL on both x86_64 and arm64. For arm64, use g5g.2xlarge
or larger instance size as g5g.xlarge
may cause failures due to the limited system memory.
To install latest NVIDIA Container Toolkit
sudo dnf install -y nvidia-container-toolkit
Refer to NVIDIA Container toolkit documentation about supported platforms, prerequisites and installation options
Verify
nvidia-container-cli -V
Output should be similar to below
cli-version: 1.17.0
lib-version: 1.17.0
build date: 2024-10-31T09:20+0000
build revision: 63d366ee3b4183513c310ac557bf31b05b83328f
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: aarch64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
Container engine configuration
Refer to NVIDIA Container Toolkit documentation about container engine configuration.
Install and configure Docker
To install and configure docker
if (cat /etc/os-release | grep -q Rocky); then
USER="rocky"
else
USER="ec2-user"
fi
sudo dnf config-manager --add-repo https://download.docker.com/linux/rhel/docker-ce.repo
sudo dnf install -y docker-ce docker-ce-cli containerd.io
sudo usermod -aG docker $USER
sudo systemctl enable docker
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Verify Docker engine configuration
To verify docker configuration
sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/docker/library/rockylinux:9 nvidia-smi
Output should be similar to below
Unable to find image 'public.ecr.aws/docker/library/rockylinux:9' locally
9: Pulling from docker/library/rockylinux
4c81ef64b3e1: Pull complete
Digest: sha256:d7be1c094cc5845ee815d4632fe377514ee6ebcf8efaed6892889657e5ddaaa6
Status: Downloaded newer image for public.ecr.aws/docker/library/rockylinux:9
Sat Nov 9 05:15:02 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA T4G Off | 00000000:00:1F.0 Off | 0 |
| N/A 48C P8 11W / 70W | 1MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Install NVIDIA driver, CUDA toolkit and NVIDIA container toolkit installation on EC2 instance at launch
To install NVIDIA driver, CUDA toolkit and NVIDIA container toolkit including Docker when launching a new GPU instance, you can use the following as user data script.
#!/bin/bash
sudo dnf update -y
OS_VERSION=$(. /etc/os-release;echo $VERSION_ID | sed -e 's/\..*//g')
if ( cat /etc/os-release | grep -q Red ); then
sudo subscription-manager repos --enable codeready-builder-for-rhel-$OS_VERSION-$(arch)-rpms
elif ( echo $OS_VERSION | grep -q 8 ); then
sudo dnf config-manager --set-enabled powertools
else
sudo dnf config-manager --set-enabled crb
fi
sudo dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-$OS_VERSION.noarch.rpm
sudo dnf install -y dkms kernel-devel kernel-modules-extra unzip gcc make vulkan-devel libglvnd-devel elfutils-libelf-devel xorg-x11-server-Xorg
sudo systemctl enable --now dkms
DISTRO=$(. /etc/os-release;echo rhel$VERSION_ID | sed -e 's/\..*//g')
if (arch | grep -q x86); then
ARCH=x86_64
else
ARCH=sbsa
fi
sudo dnf config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/$DISTRO/$ARCH/cuda-$DISTRO.repo
sudo dnf module install -y nvidia-driver:latest-dkms
sudo dnf install -y cuda-toolkit
if (cat /etc/os-release | grep -q Rocky); then
USER="rocky"
else
USER="ec2-user"
fi
sudo dnf config-manager --add-repo https://download.docker.com/linux/rhel/docker-ce.repo
sudo dnf install -y docker-ce docker-ce-cli containerd.io
sudo systemctl enable docker
sudo usermod -aG docker $USER
sudo dnf install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
sudo reboot
Verify
Connect to your EC2 instance
nvidia-smi
/usr/local/cuda/bin/nvcc -V
nvidia-container-cli -V
sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/docker/library/rockylinux:9 nvidia-smi
View /var/log/cloud-init-output.log
to troubleshoot any installation issues.
Perform post-installation actions in order to use CUDA toolkit. To verify integrity of installation, you can download, compile and run CUDA samples such as deviceQuery.
Other software
AWS CLI
To install AWS CLI (AWS Command Line Interface) v2 through Snap
sudo dnf install -y snapd
sudo systemctl enable --now snapd snapd.socket
sudo ln -s /var/lib/snapd/snap /snap
sudo snap install aws-cli --classic
Verify
Log off and log in so that your PATH variables are updated correctly.
aws --version
Output should be similar to below
aws-cli/2.19.4 Python/3.12.6 Linux/5.14.0-427.42.1.el9_4.aarch64 exe/aarch64.rhel.9
SSM Agent
To install SSM agent for Session Manager access
if (arch | grep -q x86); then
sudo dnf install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
else
sudo dnf install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_arm64/amazon-ssm-agent.rpm
fi
This requires EC2 instance to have attached IAM role with the AmazonSSMManagedInstanceCore
managed policy
EC2 Instance Connect
To install EC2 Instance Connect for secure SSH access
cd /tmp
if (arch | grep -q x86); then
ARCH=amd64
else
ARCH=arm64
fi
if ( cat /etc/os-release | grep -q 8\. ); then
curl -L -O https://amazon-ec2-instance-connect-us-west-2.s3.us-west-2.amazonaws.com/latest/linux_$ARCH/ec2-instance-connect.rhel8.rpm
curl -L -O https://amazon-ec2-instance-connect-us-west-2.s3.us-west-2.amazonaws.com/latest/linux_amd64/ec2-instance-connect-selinux.noarch.rpm
sudo dnf install -y ./ec2-instance-connect.rhel8.rpm ./ec2-instance-connect-selinux.noarch.rpm
else
curl -L -O https://amazon-ec2-instance-connect-us-west-2.s3.us-west-2.amazonaws.com/latest/linux_$ARCH/ec2-instance-connect.rpm
curl -L -O https://amazon-ec2-instance-connect-us-west-2.s3.us-west-2.amazonaws.com/latest/linux_amd64/ec2-instance-connect-selinux.noarch.rpm
sudo dnf install -y ./ec2-instance-connect.rpm ./ec2-instance-connect-selinux.noarch.rpm
fi
sudo systemctl restart sshd
Allow inbound SSH traffic in your security group
cuDNN (CUDA Deep Neural Network library)
To install cuDNN for the latest available CUDA version.
sudo dnf install -y zlib cudnn
Refer to cuDNN documentation about installation options and support matrix
NCCL (NVIDIA Collective Communication Library)
To install latest NCCL
sudo dnf install -y libnccl libnccl-devel libnccl-static
Refer to NCCL documentation about installation options
DCGM (NVIDIA Data Center GPU Manager)
To install latest DCGM
sudo dnf install -y datacenter-gpu-manager
Refer to DCGM documentation for more information
Verify
dcgmi -v
Output should be similar to below
Version : 3.3.8
Build ID : 43
Build Date : 2024-09-03
Build Type : Release
Commit ID : be8d66b4318e1d5d6e31b67759dc924d1bc18681
Branch Name : rel_dcgm_3_3
CPU Arch : aarch64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 93724fdcffc34a2656865a161c2d79df
Fabric Manager
To install latest Fabric Manager
sudo dnf install -y nvidia-fabric-manager
To install a specific version, e.g. 565
sudo dnf install -y nvidia-fabricmanager-565
Refer to Fabric Manager documentation for supported platforms and installation options
Verify
nv-fabricmanager -v
Output should be similar to below
Fabric Manager version is : 565.57.01
Relevant content
- asked 2 years agolg...
- asked a year agolg...
- asked 2 years agolg...
- AWS OFFICIALUpdated 5 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago