- I have 2 EC2 instances of size p4de.24xlarge.
- They have each been created with a single network interface on them that is EFA-enabled.
- I am able to see that the efa interface exists and that libfabric sees it, e.g. :
$ fi_info -p efa
provider: efa
fabric: efa
domain: rdmap16s27-rdm
version: 121.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: efa
domain: rdmap16s27-dgrm
version: 121.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
In addition, I've followed along the steps at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl.html and verified that the libraries mentioned there are installed (most of them are built in to the Ubuntu AMI I'm using anyway (Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.3.0 (Ubuntu 20.04) 20240611)
When I attempt to test the EFA interface via a nccl-tests run, e.g.
mpirun --mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 -x FI_EFA_USE_DEVICE_RDMA=1 -x NCCL_DEBUG=TRACE --hostfile hostfile all_gather_perf
the test eventually times out with a message like this logged on all nodes:
This error is detected locally. The connection status is unknown or was never established via handshake. This typically indicates one or more misconfigured EC2 instances; most often due to incorrect inbound/outbound security group rules and/or instances placed in different subnets. Refer to the public AWS documentation for EFA for up-to-date configuration requirements. This error can also be encountered when a peer process is no longer present.
I've verified that the EC2 nodes are on the same subnet (matching subnet IDs) and that their security group allows all traffic within the security group. I'm at a loss for how to debug this further. How can I do so?