I have been trying to resolve this problem all week.
I followed all the steps to create a 2 instance system with EFA drivers installed.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html
I started with the official CentOS 7 HVM image in the marketplace and updated it before installing the efa driver.
I have separately and on different images, tried both IntelMPI 2019.5 and 2019.6 (I installed them with the provided aws_impi.sh that points to the amazon efa driver library)
Any test I run hangs with efa enabled, the jobs start on both instances but MPI communication is not occurring.
export FI_PROVIDER=efa
Things run fine using the sockets fabric though.
export FI_PROVIDER=sockets
c5n.18xlarge instances.
I am using an updated CentOS7 image with Kernel revision:
3.10.0-1062.4.1.el7.x86_64
(I was unable to get it working on the previous kernel version 957 also)
I'm running in a subnet in a VPC with an attached EFS volume.
efa driver version 1.4.1
I did not install or test with openmpi.
The efa_test.sh fails, but it works if I use "-p sockets" instead of "-p efa".
Another test example:
#!/bin/bash
module load mpi
export FI_PROVIDER=efa
#export FI_PROVIDER=sockets
mpirun -np 2 -ppn 1 -f $PWD/hosts $PWD/mpi/pt2pt/osu_latency
Additional details:
fi_info:
Instance 1:
provider: efa
fabric: EFA-fe80::a0:1eff:febf:9407
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::a0:1eff:febf:9407
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::a0:1eff:febf:9407
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
Instance2:
provider: efa
fabric: EFA-fe80::70:3bff:fee1:52ad
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::70:3bff:fee1:52ad
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::70:3bff:fee1:52ad
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
ibstat
CA 'efa_0'
CA type:
Number of ports: 1
Firmware version:
Hardware version:
Node GUID: 0x0000000000000000
System image GUID: 0x0000000000000000
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 1
SM lid: 0
Capability mask: 0x00000000
Port GUID: 0x00a01efffebf9407
Link layer: Unknown
CA 'efa_0'
CA type:
Number of ports: 1
Firmware version:
Hardware version:
Node GUID: 0x0000000000000000
System image GUID: 0x0000000000000000
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 1
SM lid: 0
Capability mask: 0x00000000
Port GUID: 0x00703bfffee152ad
Link layer: Unknown
Some other errors I have seen but I'm not familiar with:
ibsysstat -G 0x00703bfffee152ad
ibwarn: [24340] mad_rpc_open_port: can't open UMAD port ((null):0)
ibsysstat: iberror: failed: Failed to open '(null)' port '0'
ibhosts
ibwarn: [11769] mad_rpc_open_port: can't open UMAD port ((null):0)
/codebuild/output/src613026668/src/rdmacore_build/BUILD/rdma-core-25.0/libibnetdisc/ibnetdisc.c:802; can't open MAD port ((null):0)
/usr/sbin/ibnetdiscover: iberror: failed: discover failed
Any ideas?
Thanks - Patrick