By using AWS re:Post, you agree to the Terms of Use
/Elastic Fabric Adapter/

Questions tagged with Elastic Fabric Adapter

Sort by most recent
  • 1
  • 90 / page

Browse through the questions and answers listed below or filter and sort to narrow down your results.

HOWTO make sure EFA is setup correctly

I follow the guide https://www.hpcworkshops.com/07-efa/01-create-efa-cluster.html to create a HPC cluster, and running the MPI hello world application(git clone https://github.com/mpitutorial/mpitutorial). I would like to make sure EFA is setup correctly, then I follow the steps in https://www.youtube.com/watch?v=Wq8EMMXsvyo&t=9s to verify EFA with OpenMPi and IntelMPI. **The output of IntelMPI looks fine, it indicates EFA is running:** [ec2-user@ip]$ cat hello-world-job_1.out Loading intelmpi version 2021.4.0 [0] MPI startup(): Intel(R) MPI Library, Version 2021.4 Build 20210831 (id: 758087adf) [0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved. [0] MPI startup(): library kind: release [0] MPI startup(): libfabric version: 1.13.0-impi **[0] MPI startup(): libfabric provider: efa** [0] MPI startup(): File "/opt/intel/mpi/2021.4.0/etc/tuning_skx_shm-ofi_efa_100.dat" not found [0] MPI startup(): Load tuning file: "/opt/intel/mpi/2021.4.0/etc/tuning_skx_shm-ofi_efa.dat" [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 10425 libhe-dy-c5n18xlarge-1 {0} [0] MPI startup(): 1 10404 libhe-dy-c5n18xlarge-2 {0} Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors **However, the output of OpenMPI job doesn't indicate EFA is running:** [ec2-user@ip]$ cat hello-world-job_2.out [libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: registering framework mtl components [libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: found loaded component ofi [libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: component ofi register function successful [libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: opening mtl components [libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: found loaded component ofi [libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: component ofi open function successful [libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: registering framework mtl components [libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: found loaded component ofi [libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: component ofi register function successful [libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: opening mtl components [libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: found loaded component ofi [libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: component ofi open function successful [libhe-dy-c5n18xlarge-1:11319] mca:base:select: Auto-selecting mtl components [libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Querying component [ofi] [libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Query of component [ofi] set priority to 25 [libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Selected component [ofi] [libhe-dy-c5n18xlarge-1:11319] select: initializing mtl component ofi [libhe-dy-c5n18xlarge-1:11319] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm [libhe-dy-c5n18xlarge-2:11203] mca:base:select: Auto-selecting mtl components [libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Querying component [ofi] [libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Query of component [ofi] set priority to 25 [libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Selected component [ofi] [libhe-dy-c5n18xlarge-2:11203] select: initializing mtl component ofi [libhe-dy-c5n18xlarge-2:11203] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm [libhe-dy-c5n18xlarge-1:11319] select: init returned success [libhe-dy-c5n18xlarge-1:11319] select: component ofi selected [libhe-dy-c5n18xlarge-2:11203] select: init returned success [libhe-dy-c5n18xlarge-2:11203] select: component ofi selected Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors [libhe-dy-c5n18xlarge-1:11319] mca: base: close: component ofi closed [libhe-dy-c5n18xlarge-1:11319] mca: base: close: unloading component ofi [libhe-dy-c5n18xlarge-2:11203] mca: base: close: component ofi closed [libhe-dy-c5n18xlarge-2:11203] mca: base: close: unloading component ofi **Below is the openmpi job details:** [ec2-user@ip]$ which mpirun /opt/amazon/openmpi/bin/mpirun [ec2-user@ip]$ cat openmpi_job #!/bin/bash #SBATCH --job-name=hello-world-job #SBATCH --ntasks=2 --nodes=2 #SBATCH --output=%x_%j.out mpirun ./mpi_hello_world [ec2-user@ip]$export OMPI_MCA_mtl_base_verbose=100 [ec2-user@ip]$sbatch openmpi_job **Any suggestions why running OpenMPI job doesn't indicate EFA is running?**
2
answers
0
votes
34
views
asked 13 days ago
  • 1
  • 90 / page