HOWTO make sure EFA is setup correctly
I follow the guide https://www.hpcworkshops.com/07-efa/01-create-efa-cluster.html to create a HPC cluster, and running the MPI hello world application(git clone https://github.com/mpitutorial/mpitutorial). I would like to make sure EFA is setup correctly, then I follow the steps in https://www.youtube.com/watch?v=Wq8EMMXsvyo&t=9s to verify EFA with OpenMPi and IntelMPI.
**The output of IntelMPI looks fine, it indicates EFA is running:**
[ec2-user@ip]$ cat hello-world-job_1.out
Loading intelmpi version 2021.4.0
[0] MPI startup(): Intel(R) MPI Library, Version 2021.4 Build 20210831 (id: 758087adf)
[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.13.0-impi
**[0] MPI startup(): libfabric provider: efa**
[0] MPI startup(): File "/opt/intel/mpi/2021.4.0/etc/tuning_skx_shm-ofi_efa_100.dat" not found
[0] MPI startup(): Load tuning file: "/opt/intel/mpi/2021.4.0/etc/tuning_skx_shm-ofi_efa.dat"
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 10425 libhe-dy-c5n18xlarge-1 {0}
[0] MPI startup(): 1 10404 libhe-dy-c5n18xlarge-2 {0}
Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors
Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors
**However, the output of OpenMPI job doesn't indicate EFA is running:**
[ec2-user@ip]$ cat hello-world-job_2.out
[libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: registering framework mtl components
[libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: found loaded component ofi
[libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: component ofi register function successful
[libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: opening mtl components
[libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: found loaded component ofi
[libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: component ofi open function successful
[libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: registering framework mtl components
[libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: found loaded component ofi
[libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: component ofi register function successful
[libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: opening mtl components
[libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: found loaded component ofi
[libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: component ofi open function successful
[libhe-dy-c5n18xlarge-1:11319] mca:base:select: Auto-selecting mtl components
[libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Querying component [ofi]
[libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Query of component [ofi] set priority to 25
[libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Selected component [ofi]
[libhe-dy-c5n18xlarge-1:11319] select: initializing mtl component ofi
[libhe-dy-c5n18xlarge-1:11319] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm
[libhe-dy-c5n18xlarge-2:11203] mca:base:select: Auto-selecting mtl components
[libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Querying component [ofi]
[libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Query of component [ofi] set priority to 25
[libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Selected component [ofi]
[libhe-dy-c5n18xlarge-2:11203] select: initializing mtl component ofi
[libhe-dy-c5n18xlarge-2:11203] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm
[libhe-dy-c5n18xlarge-1:11319] select: init returned success
[libhe-dy-c5n18xlarge-1:11319] select: component ofi selected
[libhe-dy-c5n18xlarge-2:11203] select: init returned success
[libhe-dy-c5n18xlarge-2:11203] select: component ofi selected
Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors
Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors
[libhe-dy-c5n18xlarge-1:11319] mca: base: close: component ofi closed
[libhe-dy-c5n18xlarge-1:11319] mca: base: close: unloading component ofi
[libhe-dy-c5n18xlarge-2:11203] mca: base: close: component ofi closed
[libhe-dy-c5n18xlarge-2:11203] mca: base: close: unloading component ofi
**Below is the openmpi job details:**
[ec2-user@ip]$ which mpirun
/opt/amazon/openmpi/bin/mpirun
[ec2-user@ip]$ cat openmpi_job
#!/bin/bash
#SBATCH --job-name=hello-world-job
#SBATCH --ntasks=2 --nodes=2
#SBATCH --output=%x_%j.out
mpirun ./mpi_hello_world
[ec2-user@ip]$export OMPI_MCA_mtl_base_verbose=100
[ec2-user@ip]$sbatch openmpi_job
**Any suggestions why running OpenMPI job doesn't indicate EFA is running?**