Passer au contenu

Why EFA in p4d.24xlarge support only RDMA Read not RDMA Write?

0

Why some Nitro only Support RDMA Read but not RDMA Write

demandé il y a 3 mois78 vues
1 réponse
0

The p4d.24xlarge instance type supports RDMA read but not RDMA write due to specific hardware and driver implementation decisions made by AWS for their Nitro v3 architecture.

According to the documentation, p4d.24xlarge and p4de.24xlarge instances both support RDMA read capabilities but not RDMA write. This is a hardware-level limitation of the Nitro v3 architecture that these instances use. Generally, RDMA write support is available on most instances with Nitro version 4 and later, while RDMA read is supported on all instances with Nitro version 4 and later.

The p4d.24xlarge instances are designed with GPUDirect RDMA capabilities to enable efficient communication between GPUs and the Elastic Fabric Adapter (EFA) for distributed training workloads. These instances feature 8 NVIDIA A100 GPUs with NVSwitch for high-bandwidth GPU-to-GPU communication, and they use EFA for efficient node-to-node communication.

It's worth noting that despite not supporting RDMA write, these instances still provide excellent performance for AI/ML and HPC workloads through their combination of powerful GPUs, high memory capacity, and EFA networking with RDMA read support. The 400 Gbps network bandwidth with EFA and GPUDirect RDMA capabilities still enables highly efficient distributed training even with this limitation.
Sources
Elastic Fabric Adapter for AI/ML and HPC workloads on Amazon EC2 - Amazon Elastic Compute Cloud
Important NVIDIA driver changes to DLAMIs - AWS Deep Learning AMIs
Cloud Compute Instances – Amazon EC2 Instance Types – AWS

répondu il y a 3 mois
AWS
EXPERT
vérifié il y a un mois

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.