EFA not working with CentOS 7 and Intel MPI

0

I have been trying to resolve this problem all week.

I followed all the steps to create a 2 instance system with EFA drivers installed.
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html

I started with the official CentOS 7 HVM image in the marketplace and updated it before installing the efa driver.

I have separately and on different images, tried both IntelMPI 2019.5 and 2019.6 (I installed them with the provided aws_impi.sh that points to the amazon efa driver library)

Any test I run hangs with efa enabled, the jobs start on both instances but MPI communication is not occurring.
export FI_PROVIDER=efa

Things run fine using the sockets fabric though.
export FI_PROVIDER=sockets

c5n.18xlarge instances.

I am using an updated CentOS7 image with Kernel revision:
3.10.0-1062.4.1.el7.x86_64
(I was unable to get it working on the previous kernel version 957 also)

I'm running in a subnet in a VPC with an attached EFS volume.
efa driver version 1.4.1

I did not install or test with openmpi.

The efa_test.sh fails, but it works if I use "-p sockets" instead of "-p efa".

Another test example:
#!/bin/bash
module load mpi
export FI_PROVIDER=efa
#export FI_PROVIDER=sockets
mpirun -np 2 -ppn 1 -f $PWD/hosts $PWD/mpi/pt2pt/osu_latency

Additional details:
fi_info:
Instance 1:
provider: efa
fabric: EFA-fe80::a0:1eff:febf:9407
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::a0:1eff:febf:9407
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::a0:1eff:febf:9407
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD

Instance2:
provider: efa
fabric: EFA-fe80::70:3bff:fee1:52ad
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::70:3bff:fee1:52ad
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::70:3bff:fee1:52ad
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD

ibstat
CA 'efa_0'
CA type:
Number of ports: 1
Firmware version:
Hardware version:
Node GUID: 0x0000000000000000
System image GUID: 0x0000000000000000
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 1
SM lid: 0
Capability mask: 0x00000000
Port GUID: 0x00a01efffebf9407
Link layer: Unknown

CA 'efa_0'
CA type:
Number of ports: 1
Firmware version:
Hardware version:
Node GUID: 0x0000000000000000
System image GUID: 0x0000000000000000
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 1
SM lid: 0
Capability mask: 0x00000000
Port GUID: 0x00703bfffee152ad
Link layer: Unknown

Some other errors I have seen but I'm not familiar with:

ibsysstat -G 0x00703bfffee152ad
ibwarn: [24340] mad_rpc_open_port: can't open UMAD port ((null):0)
ibsysstat: iberror: failed: Failed to open '(null)' port '0'

ibhosts
ibwarn: [11769] mad_rpc_open_port: can't open UMAD port ((null):0)
/codebuild/output/src613026668/src/rdmacore_build/BUILD/rdma-core-25.0/libibnetdisc/ibnetdisc.c:802; can't open MAD port ((null):0)
/usr/sbin/ibnetdiscover: iberror: failed: discover failed

Any ideas?

Thanks - Patrick

asked 5 years ago657 views
3 Answers
0

Hi Patrick - I suspect the problem lies with your Security Group rules. Like mentioned in this document:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-security

EFA requires a security group that allows inbound and outbound traffic to itself. In other words, you need an explicit inbound and outbound rule to allow traffic to sg-xxxx. A rule to allow All Traffic to 0.0.0.0 does not suffice.

Let us know if this resolves your issue.

answered 5 years ago
0

Hi Raghu,

Thank you. Adding the outbound rule resolved the issue.

  • Patrick
answered 5 years ago
0

Hi Raghu,
I am facing the same problem, but my security group configuration already added the rule for inbound and outbound from itself. Would you mind give some other suggestions I can have a try?

BTW, an additional question is about the Capacity of the EFA, is that ok to have 0x0 in "Capability mask" field?

issue:

$ ibping -S
ibwarn: [20355] mad_rpc_open_port: can't open UMAD port ((null):0)
ibping: iberror: failed: Failed to open '(null)' port '0'

Edited by: zarzen on Mar 2, 2020 11:30 AM

zarzen
answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions