- Mais recentes
- Mais votos
- Mais comentários
Based on the information available, P6-B200 instances have specific software requirements that need to be met for proper operation. These include:
- Nvidia CUDA Toolkit version 12.8 or higher
- Nvidia Driver version R570 or higher
- NVLINK 5 version R570 or higher
- Linux Kernel version 6.1 or higher
- Elastic Fabric Adapter (EFA) version 1.41.0 or higher
- AWS OFI NCCL Plugin version 1.15.0 or higher
The error you're experiencing suggests that your current RHEL 8 image may not meet these requirements, particularly regarding the Nvidia drivers and related components.
AWS ParallelCluster supports multiple operating systems including Amazon Linux 2, Amazon Linux 2023, Ubuntu 22.04, Ubuntu 24.04, RHEL 8, RHEL 9, Rocky 8, and Rocky 9. However, not all of these may have the specific driver versions required for P6-B200 instances.
Amazon Linux 2023 or Ubuntu 24.04 might be better options to try, as they are more likely to have newer kernel versions and driver support. These newer operating systems would be more likely to meet the minimum requirements for P6-B200 instances, particularly the Linux Kernel version 6.1 requirement.
To confirm GPU functionality after switching to a different OS, you should run the Nvidia GPU Device Query Test and check for NVIDIA B200 GPUs in the output. Also verify the Nvidia driver version using nvidia-smi to ensure it meets the minimum R570 requirement.
Sources
P6 Supported DLAMI - AWS Deep Learning AMIs
Operating systems - AWS ParallelCluster
Image section - AWS ParallelCluster
High-performance AI training, inference, and HPC – Amazon EC2 P6-B200 – AWS
