Skip to content

Install NVIDIA GPU driver, CUDA toolkit, NVIDIA Container Toolkit on Amazon EC2 instances running Amazon Linux 2023 (AL2023)

11 minute read
Content level: Expert
5

Steps to install NVIDIA driver, CUDA Toolkit, NVIDIA Container Toolkit, and other NVIDIA software on AL2023 (Amazon Linux 2023) (x86_64/arm64)

Overview

This article suggests how to install NVIDIA GPU driver, CUDA Toolkit, NVIDIA Container Toolkit on NVIDIA GPU EC2 instances running AL2023 (Amazon Linux 2023)

Note that by using this method, you agree to NVIDIA Driver License Agreement, End User License Agreement and other related license agreement. If you are doing development, you may want to register for NVIDIA Developer Program.

Pre-built AMIs

If you need AMIs preconfigured with NVIDIA GPU driver, CUDA, other NVIDIA software, and optionally PyTorch or TensorFlow framework, consider AWS Deep Learning AMIs. Refer to Release notes for DLAMIs for currently supported options, and Deep Learning graphical desktop on Amazon Linux 2023 (AL2023) with AWS Deep Learning AMI (DLAMI) for graphical desktop setup guidance.

For container workloads, consider Amazon ECS-optimized Linux AMIs and Amazon EKS optimized AMIs

Note: instructions in this article are not applicable to pre-built AMIs.

Custom ECS GPU-optimized AMI

If you wish to build your own custom Amazon ECS GPU-optimized AMI, install NVIDIA driver, Docker and NVIDIA container toolkit, and refer to How do I create and use custom AMIs in Amazon ECS?

About CUDA toolkit

CUDA Toolkit is generally optional when GPU instance is used to run applications (as opposed to develop applications) as the CUDA application typically packages (by statically or dynamically linking against) the CUDA runtime and libraries needed.

Prepare Amazon Linux 2023

Launch a new NVIDIA GPU instance running Amazon Linux 2023 preferably with at least 20 GB storage and connect to the instance

Kernel 6.12

If your AL2023 is running kernel 6.1, update to kernel 6.12 for improvements in scheduling, networking, security, and system tracing.

sudo dnf update -y
if (uname -r | grep -q ^6\\.1\\.); then
  sudo dnf clean all
  VER=$(dnf list kernel-headers --showduplicates| grep -E "^\s*kernel-headers" | awk '{print $2}' | sort -V | tail -1)
  VER=$VER.`arch`
  sudo dnf install -y kernel-headers-$VER kernel-devel-$VER kernel6.12-modules-extra-$VER kernel-modules-extra-common-$VER kernel6.12-$VER
  if [ -f /boot/vmlinuz-$VER ]; then
    sudo grubby --set-default "/boot/vmlinuz-$VER"
    sudo reboot
  fi
fi

Refer to Updating the Linux kernel on AL2023 for details.

Prepare AL2023

Install DKMS and kernel headers

sudo dnf clean all
sudo dnf install -y dkms 
sudo systemctl enable --now dkms
if (uname -r | grep -q ^6\\.12\\.); then
  sudo dnf install -y kernel-headers-$(uname -r) kernel-devel-$(uname -r) kernel6.12-modules-extra-$(uname -r) kernel-modules-extra-common-$(uname -r)
else
  sudo dnf install -y kernel-headers-$(uname -r) kernel-devel-$(uname -r) kernel-modules-extra-$(uname -r) kernel-modules-extra-common-$(uname -r)
fi

Install NVIDIA driver and CUDA toolkit

Method 1: Package Manager Installation

CUDA version 12.5 and higher supports Amazon Linux 2023 package manager installation on x86_64. CUDA version 12.9 and NVIDIA driver 570.148.08 adds arm64 support.

NVIDIA driver version 560 or higher from NVIDIA repository supports compute only/headless mode but not desktop mode. If you need NVIDIA graphical desktop drivers and libraries, you can

Add repo

You can choose either NVIDIA or AL2023 repository

Option 1: NVIDIA repo

if (arch | grep -q x86); then
  ARCH=x86_64
else
  ARCH=sbsa
fi
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/$ARCH/cuda-amzn2023.repo

Option 2: AL2023 repo (x86_64 only)

nvidia-release was added to 2023.6.20241031 release and enables a yum repository with NVIDIA drivers.

sudo dnf install -y nvidia-release

Install NVIDIA driver

Option 1: NVIDIA repo

sudo dnf module install -y nvidia-driver:open-dkms

To install a specific version, e.g. 575

sudo dnf module install -y nvidia-driver:575-open

The above install NVIDIA Open-source kernel module. Refer to Driver Installation Guide about NVIDIA Kernel Modules and installation options.

Option 2: AL2023 repo (x86_64 only)

sudo dnf install -y nvidia-open

Install CUDA toolkit

sudo dnf install -y cuda-toolkit

To install a specific version, e.g. 12.9

sudo dnf install -y cuda-toolkit-12-9

Refer to CUDA documentation for installation options

Method 2: Runfile Installation

Runfile installer is not supported for AL2023 and may not work.

Ensure EC2 instance has more than 10 GB of free disk space

Install development libraries

sudo dnf install -y vulkan-devel libglvnd-devel elfutils-libelf-devel xorg-x11-server-Xorg

Option 1: NVIDIA driver only

To install NVIDIA driver version 570.148.08

cd /tmp
DRIVER_VERSION=570.148.08
curl -L -O https://us.download.nvidia.com/tesla/$DRIVER_VERSION/NVIDIA-Linux-$(arch)-$DRIVER_VERSION.run
chmod +x ./NVIDIA-Linux-$(arch)-$DRIVER_VERSION.run
sudo ./NVIDIA-Linux-$(arch)-$DRIVER_VERSION.run -s

To install a specific version, refer to Driver Release Notes and modify the above line that sets DRIVER_VERSION value

Option 2: NVIDIA driver and/or CUDA toolkit

You can go to CUDA Toolkit download page to obtain latest runfile (local) installer download URL for RHEL 9 on x86_64 and arm64 sbsa.

cd /var/tmp
if (arch | grep -q x86); then
  wget https://developer.download.nvidia.com/compute/cuda/12.9.0/local_installers/cuda_12.9.0_575.51.03_linux.run
else 
  wget https://developer.download.nvidia.com/compute/cuda/12.9.0/local_installers/cuda_12.9.0_575.51.03_linux_sbsa.run
fi
chmod +x ./cuda*.run

To install another version, refer to CUDA Toolkit Archive for runfile (local) download link.

Option 2a: NVIDIA driver and CUDA toolkit

sudo ./cuda_*.run --driver --toolkit --tmpdir=/var/tmp --silent

Option 2b: CUDA toolkit only

sudo ./cuda_*.run  --toolkit --tmpdir=/var/tmp --silent

To troubleshoot compilation, view contents of /var/log/nvidia-installer.log and /var/log/cuda-installer.log (if applicable)

Refer to CUDA documentation for installation options

Runfile Uninstallation

To uninstall CUDA Toolkit, run the uninstallation script provided in the bin directory of the toolkit. For version 12.9

sudo /usr/local/cuda-12.9/bin/cuda-uninstaller

To remove NVIDIA driver

sudo /usr/bin/nvidia-uninstall

Post installation

Restart your OS

sudo reboot

Verify NVIDIA driver

nvidia-smi

Output should be similar to below

Fri May 23 14:45:57 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T4G                     Off |   00000000:00:1F.0 Off |                    0 |
| N/A   70C    P0             33W /   70W |       0MiB /  15360MiB |      8%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                        
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Verify CUDA tookit

/usr/local/cuda/bin/nvcc -V

Output should be similar to below

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Fri_Feb_21_20:26:18_PST_2025
Cuda compilation tools, release 12.8, V12.8.93
Build cuda_12.8.r12.8/compiler.35583870_0

Post-installation Actions

Refer to NVIDIA CUDA Installation Guide for Linux for post-installation actions before CUDA Toolkit can be used. For example, you may want to modify your PATH environment variable to include /usr/local/cuda/bin. For runfile installation, modify LD_LIBRARY_PATH to include /usr/local/cuda/lib

NVIDIA Container Toolkit

NVIDIA Container toolkit supports AL2023 on both x86_64 and arm64.

For arm64, use g5g.2xlarge or larger instance size as g5g.xlarge may cause failures due to the limited system memory.

if (! dnf search nvidia | grep -q nvidia-container-toolkit); then
  sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
fi
sudo dnf install -y nvidia-container-toolkit

Refer to NVIDIA Container toolkit documentation about supported platforms, prerequisites and installation options

Verify Container Toolkit

nvidia-container-cli -V

Output should be similar to below

cli-version: 1.17.7
lib-version: 1.17.7
build date: 2025-05-16T13:28+0000
build revision: d26524ab5db96a55ae86033f53de50d3794fb547
build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44)
build platform: aarch64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Container engine configuration

Refer to NVIDIA Container Toolkit site for container engine configuration instructions.

Docker

To install and configure docker

sudo dnf install -y docker
sudo systemctl enable docker
sudo usermod -aG docker ec2-user

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Verify Docker engine configuration

To verify docker configuration

sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2023 nvidia-smi

Output should be similar to below

Unable to find image 'public.ecr.aws/amazonlinux/amazonlinux:2023' locally
2023: Pulling from amazonlinux/amazonlinux
b9b2e8e61af6: Pull complete 
Digest: sha256:ff1fad724e2ef77b8851124cbc35204d1defe63128f077021a2b3e459fcd866f
Status: Downloaded newer image for public.ecr.aws/amazonlinux/amazonlinux:2023
Fri May 23 14:46:11 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.51.03              Driver Version: 575.51.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA T4G                     Off |   00000000:00:1F.0 Off |                    0 |
| N/A   68C    P0             32W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Install on EC2 instance at launch

To install NVIDIA driver and NVIDIA Container Toolkit including docker using Method 1 when launching a new AL2023 GPU instance preferably with kernel 6.12 and at least 20 GB storage, you can use the following as user data script. Uncomment line ending with cuda-toolkit to install CUDA toolkit.

#!/bin/bash
sudo dnf clean all
sudo dnf install -y dkms
sudo systemctl enable dkms
if (uname -r | grep -q ^6\\.12\\.); then
  sudo dnf install -y kernel-headers-$(uname -r) kernel-devel-$(uname -r) kernel6.12-modules-extra-$(uname -r) kernel-modules-extra-common-$(uname -r)
else
  sudo dnf install -y kernel-headers-$(uname -r) kernel-devel-$(uname -r) kernel-modules-extra-$(uname -r) kernel-modules-extra-common-$(uname -r)
fi

cd /tmp

if (arch | grep -q x86); then
  ARCH=x86_64
else
  ARCH=sbsa
fi
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/amzn2023/$ARCH/cuda-amzn2023.repo

sudo dnf module install -y nvidia-driver:open-dkms

# sudo dnf install -y cuda-toolkit

if (! dnf search nvidia | grep -q nvidia-container-toolkit); then
  sudo dnf config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
fi
sudo dnf install -y nvidia-container-toolkit

sudo dnf install -y docker
sudo systemctl enable docker
sudo usermod -aG docker ec2-user

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

sudo reboot

Verify

Connect to your EC2 instance

nvidia-smi
/usr/local/cuda/bin/nvcc -V
nvidia-container-cli -V
sudo docker run --rm --runtime=nvidia --gpus all public.ecr.aws/amazonlinux/amazonlinux:2023 nvidia-smi

View /var/log/cloud-init-output.log to troubleshoot any installation issues.

Perform post-installation actions in order to use CUDA toolkit. To verify integrity of installation, you can download, compile and run CUDA samples such as deviceQuery.

Amazon Linux 2023 on g4dn

If Docker and NVIDIA container toolkit (but not CUDA toolkit) are installed and configured, you can use CUDA samples container image to validate CUDA driver.

sudo docker run --rm --runtime=nvidia --gpus all nvcr.io/nvidia/k8s/cuda-sample:devicequery

AL2023 CUDA driver

GUI (graphical desktop) remote access

If you need remote graphical desktop access, refer to How do I install GUI (graphical desktop) on Amazon EC2 instances running Amazon Linux 2023 (AL2023)?

This article installs NVIDIA Tesla driver (also know as NVIDIA Datacenter Driver), which is intended primarily for GPU compute workloads. GRID drivers provide access to four 4K displays per GPU and are certified to provide optimal performance for professional visualization applications. Refer to GPU-accelerated graphical desktop on Amazon Linux 2023 (AL2023) with NVIDIA GRID and Amazon DCV for setup guidance.

Other Software

NVIDIA GPUDirect Storage

If you use method 1 to install NVIDIA driver only, you can install NVIDIA Magnum IO GPUDirect® Storage (GDS) and libcufile

sudo dnf install -y nvidia-gds

To install GDS only

sudo apt install -y nvidia-fs

Reboot

Reboot after installation is complete

sudo reboot

Verify

To verify installation

lsmod | grep nvidia_fs

Output should be similar to below

nvidia_fs             262144  0
nvidia              11481088  3 nvidia_uvm,nvidia_fs,nvidia_modeset

If nvidia-gds meta-package is installed

/usr/local/cuda/gds/tools/gdscheck -p

Output should be similar to below

 GDS release version: 1.14.1.1
 nvidia_fs version:  2.25 libcufile version: 2.12
 Platform: x86_64
...
...
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Nvidia Driver Info Status: Supported(Nvidia Open Driver Installed)
 Cuda Driver Version Installed:  12090
 Platform: g4dn.xlarge, Arch: x86_64(Linux 6.1.141-155.222.amzn2023.x86_64)
 Platform verification succeeded

Refer to GDS documentation and Driver installation guide for more information

5 Comments

This is great Mike!
Are there options for Graviton/ARM?

AWS
EXPERT
replied a year ago

Hello, I get ERROR when run the sample workload

[root@ip bin]# docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-sm
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: driver rpc error: failed to process request: unknown.

I using AL2023 (ami-0b17ca9fb2a39a659) on a Graviton ARM (g5g.xlarge) any advice?

replied a year ago

Worked perfectly to build an ECS-optimized GPU-ready AMI based on Al2023 (ami-01c1ede61c128dc37)! Thank you so much for this post!

replied a year ago

Been trying to do exactly that on a g4dn.xlarge machine, using these steps and also a bunch of other variations.

Keep getting:

[ec2-user@ ~]$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

[ec2-user@ ~]$ lsmod | grep nvidia
[ec2-user@ ~]$ sudo modprobe nvidia
modprobe: ERROR: could not insert 'nvidia': Unknown symbol in module, or unknown parameter (see dmesg)
[ec2-user@ ~]$ sudo dmesg | grep -i nvidia
[    4.918126] nvidia: loading out-of-tree module taints kernel.
[    4.918717] nvidia: module license 'NVIDIA' taints kernel.
[    4.944984] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    4.946115] nvidia: Unknown symbol drm_gem_object_free (err -2)
[    5.054328] nvidia: Unknown symbol drm_gem_object_free (err -2)
[  547.271166] nvidia: Unknown symbol drm_gem_object_free (err -2)
[  547.370785] nvidia: Unknown symbol drm_gem_object_free (err -2)
[  547.449667] nvidia: Unknown symbol drm_gem_object_free (err -2)
[  845.532310] nvidia: Unknown symbol drm_gem_object_free (err -2)

Apparently one might get this if the driver doesn't match the kernel (makes sense), but at this point I'm pretty sure there's something else going on.

My goal is to run a fairly straightforward Stable Diffusion setup, and I possibly need newer Python that the 3.7 (I think) the preconfigured "Deep Learning" AL 2 AMIs come with.

replied 9 months ago

Confirmed that this works with the latest AL2023 AMI as long as you have at least 15GB of storage.

replied 8 months ago