AWS announces preview of AWS Interconnect - multicloud
AWS announces AWS Interconnect – multicloud (preview), providing simple, resilient, high-speed private connections to other cloud service providers. AWS Interconnect - multicloud is easy to configure and provides high-speed, resilient connectivity with dedicated bandwidth, enabling customers to interconnect AWS networking services such as AWS Transit Gateway, AWS Cloud WAN, and Amazon VPC to other cloud service providers with ease.
How to Resolve NVML Library Loading Errors When Installing NVIDIA Device Plugin on Amazon EKS
This article explains how to configure EKS GPU nodes by selecting the correct AMI and troubleshooting methods for potential issues.
Overview
When using GPU nodes in an Amazon EKS cluster, you may encounter the following error during NVIDIA Device Plugin installation:
main.go:279] Retrieving plugins.
factory.go:31] No valid resources detected, creating a null CDI handler
factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
factory.go:112] Incompatible platform detected
factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
main.go:132] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed
This document covers how to configure EKS GPU nodes by selecting the correct AMI and troubleshooting methods for potential issues.
Resolution
1. Verify the Currently Used AMI
First, verify that the AMI you are currently using is suitable for EKS GPU nodes. Check the AMI information with the following command:
aws ec2 describe-images \
--image-ids {ami-id} \
--region {region-code} \
--query 'Images[0].[Name,Description]' \
--output text
Example output:
Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.8 (Amazon Linux 2023) 20251103
Supported EC2 instances: G4dn, G5, G6, Gr6, G6e, P4, P4de, P5, P5e, P5en, P6-B200.
This AMI is not designed for EKS nodes. Deep Learning AMIs are designed for standalone EC2 instances for ML workloads and do not include kubelet, containerd configurations, and other components required for EKS cluster integration.
2. Use Amazon EKS Optimized Accelerated AMI
For EKS GPU nodes, you must use the Amazon EKS optimized accelerated Amazon Linux AMI. This AMI comes pre-installed with: [1]
- NVIDIA drivers
- NVIDIA Container Toolkit
- AWS Neuron drivers (for Inferentia/Trainium instances)
How to get the recommended AMI ID:
aws ssm get-parameter \
--name /aws/service/eks/optimized-ami/{cluster-version}/{os-version}/{cpu-architecture}/nvidia/recommended/image_id \
--region {region-code} \
--query "Parameter.Value" \
--output text
Parameter descriptions:
cluster-version: EKS cluster version (e.g., 1.33)os-version: Operating system version (amazon-linux-2 or amazon-linux-2023)cpu-architecture: CPU architecture (x86_64 or arm64)region-code: AWS region (e.g., us-east-1)
Example:
aws ssm get-parameter \
--name /aws/service/eks/optimized-ami/1.33/amazon-linux-2023/x86_64/nvidia/recommended/image_id \
--region us-east-1 \
--query "Parameter.Value" \
--output text
Example output: ami-0efa341496d305795
Verify the AMI:
aws ec2 describe-images \
--image-ids ami-0efa341496d305795 \
--region us-east-1 \
--query 'Images[0].[Name,Description]' \
--output text
Output:
amazon-eks-node-al2023-x86_64-nvidia-1.33-v20251120
EKS-optimized Kubernetes node based on Amazon Linux 2023, (k8s: 1.33.5, containerd: 2.1.*)
AMI release information can be found on GitHub: [2]
3. Deploy NVIDIA Device Plugin
After joining GPU nodes to the cluster with the correct AMI, deploy the NVIDIA Device Plugin as a DaemonSet. [3]
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/vX.X.X/deployments/static/nvidia-device-plugin.yml
Verify GPU allocatable resources:
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
4. Test GPU Functionality
Create a test Pod (nvidia-smi.yaml):
apiVersion: v1
kind: Pod
metadata:
name: nvidia-smi
spec:
restartPolicy: OnFailure
containers:
- name: nvidia-smi
image: nvidia/cuda:tag
args:
- "nvidia-smi"
resources:
limits:
nvidia.com/gpu: 1
Deploy and check logs:
kubectl apply -f nvidia-smi.yaml
kubectl logs nvidia-smi
Expected output:
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 46C P0 47W / 300W | 0MiB / 16160MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Further Troubleshooting
Issue: NVML errors occur even when using the Amazon EKS optimized accelerated AMI.
Cause: The containerd configuration may have been redefined by userdata scripts, changing the default runtime from nvidia to runc.
Resolution:
-
Connect to the node and check the current runtime:
cat /etc/containerd/config.toml | grep default_runtime_nameIf the output is "runc", proceed with the following steps.
-
Check UserData
If
/etc/eks/bootstrap.sh(AL2) ornodeadm init(AL2023) is called redundantly through UserData, the containerd configuration may have been redefined.In this case, remove the line or add a line to reconfigure the NVIDIA runtime in UserData: [4]
/usr/bin/nvidia-ctk runtime configure --runtime=containerd --set-as-default -
Create nodes using the new UserData
-
Connect to the new node and verify the configuration change:
cat /etc/containerd/config.toml | grep default_runtime_nameIf the output is "nvidia", the configuration is correct.
Conclusion
When using GPU nodes in Amazon EKS, NVML library loading errors are mostly caused by incorrect AMI selection or containerd runtime configuration issues.
Key Points:
- Use Amazon EKS optimized accelerated AMI, not Deep Learning AMI
- Be careful not to overwrite containerd configuration in UserData
- When issues occur, verify that
default_runtime_nameis set to "nvidia" - After changing runtime configuration, create new nodes with the correct UserData
By following this guide to select the correct AMI and maintain proper configuration, you can run GPU workloads stably in your EKS cluster.
References
[1] Amazon EKS optimized accelerated Amazon Linux AMI
[2] Amazon EKS AMI release notes
- Topics
- Containers
- Language
- English
Relevant content
- Accepted Answerasked 4 months ago
AWS OFFICIALUpdated 3 years ago