Skip to content

How to Resolve NVML Library Loading Errors When Installing NVIDIA Device Plugin on Amazon EKS

5 minute read
Content level: Intermediate
7

This article explains how to configure EKS GPU nodes by selecting the correct AMI and troubleshooting methods for potential issues.

Overview

When using GPU nodes in an Amazon EKS cluster, you may encounter the following error during NVIDIA Device Plugin installation:

main.go:279] Retrieving plugins.
factory.go:31] No valid resources detected, creating a null CDI handler
factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
factory.go:112] Incompatible platform detected
factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
main.go:132] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed

This document covers how to configure EKS GPU nodes by selecting the correct AMI and troubleshooting methods for potential issues.

Resolution

1. Verify the Currently Used AMI

First, verify that the AMI you are currently using is suitable for EKS GPU nodes. Check the AMI information with the following command:

aws ec2 describe-images \
  --image-ids {ami-id} \
  --region {region-code} \
  --query 'Images[0].[Name,Description]' \
  --output text

Example output:

Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.8 (Amazon Linux 2023) 20251103 
Supported EC2 instances: G4dn, G5, G6, Gr6, G6e, P4, P4de, P5, P5e, P5en, P6-B200.

This AMI is not designed for EKS nodes. Deep Learning AMIs are designed for standalone EC2 instances for ML workloads and do not include kubelet, containerd configurations, and other components required for EKS cluster integration.

2. Use Amazon EKS Optimized Accelerated AMI

For EKS GPU nodes, you must use the Amazon EKS optimized accelerated Amazon Linux AMI. This AMI comes pre-installed with: [1]

  • NVIDIA drivers
  • NVIDIA Container Toolkit
  • AWS Neuron drivers (for Inferentia/Trainium instances)

How to get the recommended AMI ID:

aws ssm get-parameter \
  --name /aws/service/eks/optimized-ami/{cluster-version}/{os-version}/{cpu-architecture}/nvidia/recommended/image_id \
  --region {region-code} \
  --query "Parameter.Value" \
  --output text

Parameter descriptions:

  • cluster-version: EKS cluster version (e.g., 1.33)
  • os-version: Operating system version (amazon-linux-2 or amazon-linux-2023)
  • cpu-architecture: CPU architecture (x86_64 or arm64)
  • region-code: AWS region (e.g., us-east-1)

Example:

aws ssm get-parameter \
  --name /aws/service/eks/optimized-ami/1.33/amazon-linux-2023/x86_64/nvidia/recommended/image_id \
  --region us-east-1 \
  --query "Parameter.Value" \
  --output text

Example output: ami-0efa341496d305795

Verify the AMI:

aws ec2 describe-images \
  --image-ids ami-0efa341496d305795 \
  --region us-east-1 \
  --query 'Images[0].[Name,Description]' \
  --output text

Output:

amazon-eks-node-al2023-x86_64-nvidia-1.33-v20251120
EKS-optimized Kubernetes node based on Amazon Linux 2023, (k8s: 1.33.5, containerd: 2.1.*)

AMI release information can be found on GitHub: [2]

3. Deploy NVIDIA Device Plugin

After joining GPU nodes to the cluster with the correct AMI, deploy the NVIDIA Device Plugin as a DaemonSet. [3]

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/vX.X.X/deployments/static/nvidia-device-plugin.yml

Verify GPU allocatable resources:

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

4. Test GPU Functionality

Create a test Pod (nvidia-smi.yaml):

apiVersion: v1
kind: Pod
metadata:
  name: nvidia-smi
spec:
  restartPolicy: OnFailure
  containers:
  - name: nvidia-smi
    image: nvidia/cuda:tag
    args:
    - "nvidia-smi"
    resources:
      limits:
        nvidia.com/gpu: 1

Deploy and check logs:

kubectl apply -f nvidia-smi.yaml
kubectl logs nvidia-smi

Expected output:

|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   46C    P0    47W / 300W |      0MiB / 16160MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Further Troubleshooting

Issue: NVML errors occur even when using the Amazon EKS optimized accelerated AMI.

Cause: The containerd configuration may have been redefined by userdata scripts, changing the default runtime from nvidia to runc.

Resolution:

  1. Connect to the node and check the current runtime:

    cat /etc/containerd/config.toml | grep default_runtime_name
    

    If the output is "runc", proceed with the following steps.

  2. Check UserData

    If /etc/eks/bootstrap.sh (AL2) or nodeadm init (AL2023) is called redundantly through UserData, the containerd configuration may have been redefined.

    In this case, remove the line or add a line to reconfigure the NVIDIA runtime in UserData: [4]

    /usr/bin/nvidia-ctk runtime configure --runtime=containerd --set-as-default
    
  3. Create nodes using the new UserData

  4. Connect to the new node and verify the configuration change:

    cat /etc/containerd/config.toml | grep default_runtime_name
    

    If the output is "nvidia", the configuration is correct.

Conclusion

When using GPU nodes in Amazon EKS, NVML library loading errors are mostly caused by incorrect AMI selection or containerd runtime configuration issues.

Key Points:

  • Use Amazon EKS optimized accelerated AMI, not Deep Learning AMI
  • Be careful not to overwrite containerd configuration in UserData
  • When issues occur, verify that default_runtime_name is set to "nvidia"
  • After changing runtime configuration, create new nodes with the correct UserData

By following this guide to select the correct AMI and maintain proper configuration, you can run GPU workloads stably in your EKS cluster.

References

[1] Amazon EKS optimized accelerated Amazon Linux AMI

[2] Amazon EKS AMI release notes

[3] Running machine learning inference on Amazon EKS

[4] NVIDIA Container Toolkit installation guide

AWS
SUPPORT ENGINEER
published 7 days ago72 views