Salta al contenuto

Do any of the AWS Parallel Cluster images support p6-b200?

0

I tried using a p6-b200 instance with my AWS Parallel Cluster (version 3.13.1). I'm using the rhel8 image. The compute node boots up, but there seems to be a problem with the Nvidia drivers. My application gets an error from CUDA when it tries to initialize:

Failed to get device count: system not yet initialized

nvidia-smi doesn't show any errors, but I can see that the NVIDIA fabric manager service isn't running and it reports an error if I try to start it.

Does anyone know if any of the AWS Parallel Cluster AMIs will work with p6-b200? Maybe rhel9 or alinux2023?

posta 5 mesi fa163 visualizzazioni
1 Risposta
0

Based on the information available, P6-B200 instances have specific software requirements that need to be met for proper operation. These include:

  • Nvidia CUDA Toolkit version 12.8 or higher
  • Nvidia Driver version R570 or higher
  • NVLINK 5 version R570 or higher
  • Linux Kernel version 6.1 or higher
  • Elastic Fabric Adapter (EFA) version 1.41.0 or higher
  • AWS OFI NCCL Plugin version 1.15.0 or higher

The error you're experiencing suggests that your current RHEL 8 image may not meet these requirements, particularly regarding the Nvidia drivers and related components.

AWS ParallelCluster supports multiple operating systems including Amazon Linux 2, Amazon Linux 2023, Ubuntu 22.04, Ubuntu 24.04, RHEL 8, RHEL 9, Rocky 8, and Rocky 9. However, not all of these may have the specific driver versions required for P6-B200 instances.

Amazon Linux 2023 or Ubuntu 24.04 might be better options to try, as they are more likely to have newer kernel versions and driver support. These newer operating systems would be more likely to meet the minimum requirements for P6-B200 instances, particularly the Linux Kernel version 6.1 requirement.

To confirm GPU functionality after switching to a different OS, you should run the Nvidia GPU Device Query Test and check for NVIDIA B200 GPUs in the output. Also verify the Nvidia driver version using nvidia-smi to ensure it meets the minimum R570 requirement.
Sources
P6 Supported DLAMI - AWS Deep Learning AMIs
Operating systems - AWS ParallelCluster
Image section - AWS ParallelCluster
High-performance AI training, inference, and HPC – Amazon EC2 P6-B200 – AWS

con risposta 5 mesi fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.