Today I follow efa get started doc https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html#efa-start-enable to install EFA software on Amazon Linux 2, it failed on the last step "Testing EFA device" with below error:
$ sudo ./efa_installer.sh -y
= Starting Amazon Elastic Fabric Adapter Installation Script =
= EFA Installer Version: 1.17.2 =
......
......
== Testing EFA device ==
Starting server...
Starting client...
Error: fi_pingpong test timed out.
==============================================================================
An EFA device has been detected but a ping test has failed. Please consult the
EFA documentation to verify your configuration.
===========================================================================
===================================================
EFA installation complete.
- Please logout/login to complete the installation.
- Libfabric was installed in /opt/amazon/efa
- Open MPI was installed in /opt/amazon/openmpi
===================================================
Log from dmesg:
[ 47.835056] efa 0000:00:06.0 rdmap0s6: Unregister ib device
[ 47.966388] Elastic Fabric Adapter (EFA) v1.16.0g
[ 48.080918] efa 0000:00:06.0: Setup irq:27 name:efa-mgmnt@pci:0000:00:06.0
[ 48.085540] efa 0000:00:06.0 efa_0: IB device registered
[ 50.633468] efa 0000:00:06.0 rdmap0s6: Failed to process command DEREG_MR (opcode 8) comp_status 7 err -22
[ 50.637880] efa 0000:00:06.0 rdmap0s6: Failed to de-register mr(lkey-4) [-22]
[ 50.647230] efa 0000:00:06.0 rdmap0s6: Failed to process command DEREG_MR (opcode 8) comp_status 7 err -22
[ 50.651981] efa 0000:00:06.0 rdmap0s6: Failed to de-register mr(lkey-3) [-22]
However, it does work just a few days ago(last week) when I did the same testing. Is there any update for the EFA in the last few days?
I have a same issue like that: efa_test.sh with below error: Starting server...
Starting client...
Error: fi_pingpong test timed out.
dmesg: [ 803.733432] efa 0000:00:06.0 rdmap0s6: Failed to process command DEREG_MR (opcode 8) comp_status 7 err -22 [ 803.737575] efa 0000:00:06.0 rdmap0s6: Failed to de-register mr(lkey-0) [-22] [ 892.965223] efa 0000:00:06.0 rdmap0s6: Failed to process command DEREG_MR (opcode 8) comp_status 7 err -22 [ 892.969332] efa 0000:00:06.0 rdmap0s6: Failed to de-register mr(lkey-0) [-22]
my security group had allowed all inbound and outbound traffic to and from
so what's the problem?