EFA not working with CentOS 7 and Intel MPI

0

I have been trying to resolve this problem all week.

I followed all the steps to create a 2 instance system with EFA drivers installed.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html

I started with the official CentOS 7 HVM image in the marketplace and updated it before installing the efa driver.

I have separately and on different images, tried both IntelMPI 2019.5 and 2019.6 (I installed them with the provided aws_impi.sh that points to the amazon efa driver library)

Any test I run hangs with efa enabled, the jobs start on both instances but MPI communication is not occurring.
export FI_PROVIDER=efa

Things run fine using the sockets fabric though.
export FI_PROVIDER=sockets

c5n.18xlarge instances.

I am using an updated CentOS7 image with Kernel revision:
3.10.0-1062.4.1.el7.x86_64
(I was unable to get it working on the previous kernel version 957 also)

I'm running in a subnet in a VPC with an attached EFS volume.
efa driver version 1.4.1

I did not install or test with openmpi.

The efa_test.sh fails, but it works if I use "-p sockets" instead of "-p efa".

Another test example:
#!/bin/bash
module load mpi
export FI_PROVIDER=efa
#export FI_PROVIDER=sockets
mpirun -np 2 -ppn 1 -f $PWD/hosts $PWD/mpi/pt2pt/osu_latency

Additional details:
fi_info:
Instance 1:
provider: efa
fabric: EFA-fe80::a0:1eff:febf:9407
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::a0:1eff:febf:9407
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::a0:1eff:febf:9407
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD

Instance2:
provider: efa
fabric: EFA-fe80::70:3bff:fee1:52ad
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::70:3bff:fee1:52ad
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::70:3bff:fee1:52ad
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD

ibstat
CA 'efa_0'
CA type:
Number of ports: 1
Firmware version:
Hardware version:
Node GUID: 0x0000000000000000
System image GUID: 0x0000000000000000
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 1
SM lid: 0
Capability mask: 0x00000000
Port GUID: 0x00a01efffebf9407
Link layer: Unknown

CA 'efa_0'
CA type:
Number of ports: 1
Firmware version:
Hardware version:
Node GUID: 0x0000000000000000
System image GUID: 0x0000000000000000
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 1
SM lid: 0
Capability mask: 0x00000000
Port GUID: 0x00703bfffee152ad
Link layer: Unknown

Some other errors I have seen but I'm not familiar with:

ibsysstat -G 0x00703bfffee152ad
ibwarn: [24340] mad_rpc_open_port: can't open UMAD port ((null):0)
ibsysstat: iberror: failed: Failed to open '(null)' port '0'

ibhosts
ibwarn: [11769] mad_rpc_open_port: can't open UMAD port ((null):0)
/codebuild/output/src613026668/src/rdmacore_build/BUILD/rdma-core-25.0/libibnetdisc/ibnetdisc.c:802; can't open MAD port ((null):0)
/usr/sbin/ibnetdiscover: iberror: failed: discover failed

Any ideas?

Thanks - Patrick

gefragt vor 4 Jahren533 Aufrufe
3 Antworten
0

Hi Patrick - I suspect the problem lies with your Security Group rules. Like mentioned in this document:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security

EFA requires a security group that allows inbound and outbound traffic to itself. In other words, you need an explicit inbound and outbound rule to allow traffic to sg-xxxx. A rule to allow All Traffic to 0.0.0.0 does not suffice.

Let us know if this resolves your issue.

beantwortet vor 4 Jahren
0

Hi Raghu,

Thank you. Adding the outbound rule resolved the issue.

  • Patrick
beantwortet vor 4 Jahren
0

Hi Raghu,
I am facing the same problem, but my security group configuration already added the rule for inbound and outbound from itself. Would you mind give some other suggestions I can have a try?

BTW, an additional question is about the Capacity of the EFA, is that ok to have 0x0 in "Capability mask" field?

issue:

$ ibping -S
ibwarn: [20355] mad_rpc_open_port: can't open UMAD port ((null):0)
ibping: iberror: failed: Failed to open '(null)' port '0'

Edited by: zarzen on Mar 2, 2020 11:30 AM

zarzen
beantwortet vor 4 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen