HOWTO make sure EFA is setup correctly

1

I follow the guide https://www.hpcworkshops.com/07-efa/01-create-efa-cluster.html to create a HPC cluster, and running the MPI hello world application(git clone https://github.com/mpitutorial/mpitutorial). I would like to make sure EFA is setup correctly, then I follow the steps in https://www.youtube.com/watch?v=Wq8EMMXsvyo&t=9s to verify EFA with OpenMPi and IntelMPI.

The output of IntelMPI looks fine, it indicates EFA is running:

[ec2-user@ip]$ cat hello-world-job_1.out

Loading intelmpi version 2021.4.0

[0] MPI startup(): Intel(R) MPI Library, Version 2021.4 Build 20210831 (id: 758087adf)

[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.

[0] MPI startup(): library kind: release

[0] MPI startup(): libfabric version: 1.13.0-impi

[0] MPI startup(): libfabric provider: efa

[0] MPI startup(): File "/opt/intel/mpi/2021.4.0/etc/tuning_skx_shm-ofi_efa_100.dat" not found

[0] MPI startup(): Load tuning file: "/opt/intel/mpi/2021.4.0/etc/tuning_skx_shm-ofi_efa.dat"

[0] MPI startup(): Rank Pid Node name Pin cpu

[0] MPI startup(): 0 10425 libhe-dy-c5n18xlarge-1 {0}

[0] MPI startup(): 1 10404 libhe-dy-c5n18xlarge-2 {0}

Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors

Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors

However, the output of OpenMPI job doesn't indicate EFA is running:

[ec2-user@ip]$ cat hello-world-job_2.out

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: registering framework mtl components

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: found loaded component ofi

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: component ofi register function successful

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: opening mtl components

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: found loaded component ofi

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: component ofi open function successful

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: registering framework mtl components

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: found loaded component ofi

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: component ofi register function successful

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: opening mtl components

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: found loaded component ofi

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: component ofi open function successful

[libhe-dy-c5n18xlarge-1:11319] mca:base:select: Auto-selecting mtl components

[libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Querying component [ofi]

[libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Query of component [ofi] set priority to 25

[libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Selected component [ofi]

[libhe-dy-c5n18xlarge-1:11319] select: initializing mtl component ofi

[libhe-dy-c5n18xlarge-1:11319] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm

[libhe-dy-c5n18xlarge-2:11203] mca:base:select: Auto-selecting mtl components

[libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Querying component [ofi]

[libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Query of component [ofi] set priority to 25

[libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Selected component [ofi]

[libhe-dy-c5n18xlarge-2:11203] select: initializing mtl component ofi

[libhe-dy-c5n18xlarge-2:11203] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm

[libhe-dy-c5n18xlarge-1:11319] select: init returned success

[libhe-dy-c5n18xlarge-1:11319] select: component ofi selected

[libhe-dy-c5n18xlarge-2:11203] select: init returned success

[libhe-dy-c5n18xlarge-2:11203] select: component ofi selected

Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors

Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors

[libhe-dy-c5n18xlarge-1:11319] mca: base: close: component ofi closed

[libhe-dy-c5n18xlarge-1:11319] mca: base: close: unloading component ofi

[libhe-dy-c5n18xlarge-2:11203] mca: base: close: component ofi closed

[libhe-dy-c5n18xlarge-2:11203] mca: base: close: unloading component ofi

Below is the openmpi job details: [ec2-user@ip]$ which mpirun

/opt/amazon/openmpi/bin/mpirun

[ec2-user@ip]$ cat openmpi_job

#!/bin/bash #SBATCH --job-name=hello-world-job

#SBATCH --ntasks=2 --nodes=2

#SBATCH --output=%x_%j.out

mpirun ./mpi_hello_world

[ec2-user@ip]$export OMPI_MCA_mtl_base_verbose=100

[ec2-user@ip]$sbatch openmpi_job

Any suggestions why running OpenMPI job doesn't indicate EFA is running?

已提问 2 年前661 查看次数
2 回答
1

Hello, thank you for your post. Before you run the mpirun command, please make sure you have add the EFA library to the path. Depending upon which operating sytem you are using, you may use one of the following commands[1].

Amazon Linux, Amazon Linux 2, RHEL , and CentOS

$ export LD_LIBRARY_PATH=/opt/amazon/efa/lib64:$LD_LIBRARY_PATH

Ubuntu 18.04/20.04

$ export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH

References:
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl-base.html#nccl-start-base-tests

AWS
支持工程师
已回答 2 年前
  • Thanks for your reply@AWS_SamM

    I am using Amazon Linux 2, the output is still same without EFA indicate after set LD_LIBRARY_PATH=/opt/amazon/efa/lib64:$LD_LIBRARY_PATH

    [ec2-user@ip code]$ export LD_LIBRARY_PATH=/opt/amazon/efa/lib64:$LD_LIBRARY_PATH [ec2-user@ip]$ export OMPI_MCA_mtl_base_verbose=100 [ec2-user@ip]$ sbatch submit_job_openmpi Submitted batch job 5

    [ec2-user@ip]$ cat hello-world-job_5.out

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: registering framework mtl components

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: found loaded component ofi

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: component ofi register function successful

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: opening mtl components

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: found loaded component ofi

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: component ofi open function successful

0

Thanks for your reply@AWS_SamM

The output is still same without EFA indicate after set LD_LIBRARY_PATH=/opt/amazon/efa/lib64:$LD_LIBRARY_PATH

[ec2-user@ip code]$ export LD_LIBRARY_PATH=/opt/amazon/efa/lib64:$LD_LIBRARY_PATH

[ec2-user@ip]$ export OMPI_MCA_mtl_base_verbose=100

[ec2-user@ip]$ sbatch submit_job_openmpi Submitted batch job 5

[ec2-user@ip]$ cat hello-world-job_5.out

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: registering framework mtl components

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: found loaded component ofi

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: component ofi register function successful

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: opening mtl components

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: found loaded component ofi

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: component ofi open function successful

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_register: registering framework mtl components

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_register: found loaded component ofi

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_register: component ofi register function successful

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_open: opening mtl components

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_open: found loaded component ofi

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_open: component ofi open function successful

[libhe-dy-c5n18xlarge-2:11589] mca:base:select: Auto-selecting mtl components

[libhe-dy-c5n18xlarge-2:11589] mca:base:select:( mtl) Querying component [ofi]

[libhe-dy-c5n18xlarge-2:11589] mca:base:select:( mtl) Query of component [ofi] set priority to 25

[libhe-dy-c5n18xlarge-2:11589] mca:base:select:( mtl) Selected component [ofi]

[libhe-dy-c5n18xlarge-2:11589] select: initializing mtl component ofi

[libhe-dy-c5n18xlarge-2:11589] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm

[libhe-dy-c5n18xlarge-1:11601] mca:base:select: Auto-selecting mtl components

[libhe-dy-c5n18xlarge-1:11601] mca:base:select:( mtl) Querying component [ofi]

[libhe-dy-c5n18xlarge-1:11601] mca:base:select:( mtl) Query of component [ofi] set priority to 25

[libhe-dy-c5n18xlarge-1:11601] mca:base:select:( mtl) Selected component [ofi]

[libhe-dy-c5n18xlarge-1:11601] select: initializing mtl component ofi

[libhe-dy-c5n18xlarge-1:11601] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm

[libhe-dy-c5n18xlarge-2:11589] select: init returned success

[libhe-dy-c5n18xlarge-2:11589] select: component ofi selected

[libhe-dy-c5n18xlarge-1:11601] select: init returned success

[libhe-dy-c5n18xlarge-1:11601] select: component ofi selected

Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors

Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors

[libhe-dy-c5n18xlarge-2:11589] mca: base: close: component ofi closed

[libhe-dy-c5n18xlarge-2:11589] mca: base: close: unloading component ofi

[libhe-dy-c5n18xlarge-1:11601] mca: base: close: component ofi closed

[libhe-dy-c5n18xlarge-1:11601] mca: base: close: unloading component ofi

已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则