Skip to content

How do I resolve NVIDIA GPU and GPU driver communication issues in Amazon EC2?

5 minute read
0

My Amazon Elastic Compute Cloud (Amazon EC2) instance type is GPU. However, I can't communicate with the NVIDIA GPU or GPU drivers.

Resolution

You might encounter the following error messages based on the command or tool that you use to communicate with the GPU or GPU drivers.

For nvidia-smi, you receive the following error message:

"NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running."

For jax.devices(), you receive the following error message in the terminal:

"WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) [CpuDevice(id=0)]"

Typically, you receive the preceding error messages because of compatibility issues between your hardware, drivers, and libraries. To use instance types with GPUs, you must install GPU drivers and related libraries to your operating system (OS). Also, the tools and libraries that communicate with the GPU and GPU drivers must be GPU-compatible.

To check whether you received a jax.devices() error, run the following commands:

python
import jax
jax.devices()

Example output:

$ python
Python 3.9.22 (main, Apr 29 2025, 00:00:00) 
[GCC 11.5.0 20240719 (Red Hat 11.5.0-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
>>> jax.devices()
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
[CpuDevice(id=0)]

Verify that your AMI supports the instance type that you used

It's a best practice to use AWS Deep Learning AMIs (DLAMI) for NVDIA GPU drivers. Check the release notes for the DLAMI that you want to use to make sure that it's compatible with your configuration.

Note: NVIDIA drivers have two types of DLAMIs that use either proprietary or open source drivers. Each DLAMI supports specific instance types.

Verify that you installed the CUDA-compatible library

If you tried to run jax.devices() and an encountered error, then you might not have installed the CUDA-compatible JAX library.

To check whether you installed the library, run the following command:

pip list|grep jax

Example output:

$ pip list|grep jax
jax                              0.4.30
jaxlib                           0.4.30

If you the CUDA-compatible libraries aren't in the command's output, then run the following command to install the JAX library:

pip install -U "jax[cuda12]"

Note: Replace jax[cuda12] with your CUDA version.

Then, rerun the pip list command to verify that you correctly installed the library.

Example output:

$ pip list|grep jax
jax                              0.4.30
jax-cuda12-pjrt                  0.4.30
jax-cuda12-plugin                0.4.30
jaxlib                           0.4.30

Verify that the CUDA versions in "nvcc --version" and "nvidia-smi" are the same

First, run the following command:

nvcc --version

Example output:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0

Then, run the following command:

nvidia-smi

Example output:

$ nvidia-smi
Mon Apr 21 10:52:36 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+

Compare the CUDA versions in the outputs of the commands. In the preceding example, the issue occurred because you can use CUDA version 12.2 or earlier only in the NVIDIA Driver: 535.183.01. However, the configuration uses CUDA version 12.5.

To resolve the issue, run the following commands based on your OS to uninstall CUDA 12.5 and install CUDA 12.2.

Amazon Linux 2023 (AL2023), Red Hat Enterprise Linux (RHEL) 8, or RHEL 9:

sudo dnf remove cuda-toolkit-12-5
sudo dnf install cuda-toolkit-12-2

Amazon Linux 2 (AL2) or RHEL 7:

sudo yum remove cuda-toolkit-12-5
sudo yum install cuda-toolkit-12-2

Ubuntu:

sudo apt remove cuda-toolkit-12-5
sudo apt install cuda-toolkit-12-2

After you install CUDA 12.2, run the following command to verify that you can use jax.devices() to communicate with the GPU:

import jax
jax.devices()
[cuda(id=0)]

To avoid GPU compatibility issues, it's a best practice to use DLAMIs with optimized NVIDIA drivers, a configured CUDA toolkit, and better support and compatibility.

(Optional) Upgrade your kernel and driver

Important: When you use a supported Amazon Machine Image (AMI), you don't need to manually install NVIDIA drivers because they're already installed on the AMI.

It's not a best practice to update your kernel version to maintain compatibility with the installed driver and package version. However, if you must update your kernel version because of a security patch, then run the following commands to update the kernel:

sudo dnf versionlock delete kernel*
sudo dnf update -y

Note: The preceding example commands are for the AWS Deep Learning Base AMI (Amazon Linux 2023).

To manually install or update the NVIDIA drivers, see Installation options.

Verify that communication works as expected

To test that your instance can communicate with the GPU, run one of the following commands:

nvidia-smi

-or-

import jax
jax.devices()

Example outputs:

$ nvidia-smi
Wed May 21 11:04:43 2025 
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20 Driver Version: 570.133.20 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 |
| 0% 32C P8 10W / 300W | 0MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
>>> import jax
>>> jax.devices()
[cuda(id=0)]

Related information

NVIDIA drivers for your Amazon EC2 instance

AWS OFFICIALUpdated a year ago