Skip to content

My EFA-enabled EC2 instances are unable to handshake with eachother, despite what appears to be correct setup.

0
  • I have 2 EC2 instances of size p4de.24xlarge.
  • They have each been created with a single network interface on them that is EFA-enabled.
  • I am able to see that the efa interface exists and that libfabric sees it, e.g. :
$ fi_info -p efa
provider: efa
    fabric: efa
    domain: rdmap16s27-rdm
    version: 121.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: efa
    domain: rdmap16s27-dgrm
    version: 121.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA

In addition, I've followed along the steps at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl.html and verified that the libraries mentioned there are installed (most of them are built in to the Ubuntu AMI I'm using anyway (Deep Learning OSS Nvidia Driver AMI GPU PyTorch 2.3.0 (Ubuntu 20.04) 20240611)

When I attempt to test the EFA interface via a nccl-tests run, e.g.

mpirun --mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 -x FI_EFA_USE_DEVICE_RDMA=1 -x NCCL_DEBUG=TRACE --hostfile hostfile all_gather_perf

the test eventually times out with a message like this logged on all nodes:

This error is detected locally. The connection status is unknown or was never established via handshake. This typically indicates one or more misconfigured EC2 instances; most often due to incorrect inbound/outbound security group rules and/or instances placed in different subnets. Refer to the public AWS documentation for EFA for up-to-date configuration requirements. This error can also be encountered when a peer process is no longer present.

I've verified that the EC2 nodes are on the same subnet (matching subnet IDs) and that their security group allows all traffic within the security group. I'm at a loss for how to debug this further. How can I do so?

asked a year ago429 views
2 Answers
1
Accepted Answer

Answering my own question here. The primary problem was that:

  • I had used only 1 EFA interface on each instance, whereas these instance types support and require 4 EFA interfaces.
  • To have 4 EFA interfaces, I also had to ensure that the instances were on a purely private subnet, as AWS does not support instances that have multiple network interfaces and a public IP address.
answered a year ago
EXPERT
reviewed a year ago
0

@aviv why wont 1 EFA Work here? will it work in g4dn.8xlarge .

@olekssi what is cheapest option to run nccl and efa i guess g4dn.8xlarge?

answered 8 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.