HOWTO make sure EFA is setup correctly

1

I follow the guide https://www.hpcworkshops.com/07-efa/01-create-efa-cluster.html to create a HPC cluster, and running the MPI hello world application(git clone https://github.com/mpitutorial/mpitutorial). I would like to make sure EFA is setup correctly, then I follow the steps in https://www.youtube.com/watch?v=Wq8EMMXsvyo&t=9s to verify EFA with OpenMPi and IntelMPI.

The output of IntelMPI looks fine, it indicates EFA is running:

[ec2-user@ip]$ cat hello-world-job_1.out

Loading intelmpi version 2021.4.0

[0] MPI startup(): Intel(R) MPI Library, Version 2021.4 Build 20210831 (id: 758087adf)

[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.

[0] MPI startup(): library kind: release

[0] MPI startup(): libfabric version: 1.13.0-impi

[0] MPI startup(): libfabric provider: efa

[0] MPI startup(): File "/opt/intel/mpi/2021.4.0/etc/tuning_skx_shm-ofi_efa_100.dat" not found

[0] MPI startup(): Load tuning file: "/opt/intel/mpi/2021.4.0/etc/tuning_skx_shm-ofi_efa.dat"

[0] MPI startup(): Rank Pid Node name Pin cpu

[0] MPI startup(): 0 10425 libhe-dy-c5n18xlarge-1 {0}

[0] MPI startup(): 1 10404 libhe-dy-c5n18xlarge-2 {0}

Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors

Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors

However, the output of OpenMPI job doesn't indicate EFA is running:

[ec2-user@ip]$ cat hello-world-job_2.out

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: registering framework mtl components

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: found loaded component ofi

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: component ofi register function successful

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: opening mtl components

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: found loaded component ofi

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: component ofi open function successful

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: registering framework mtl components

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: found loaded component ofi

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: component ofi register function successful

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: opening mtl components

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: found loaded component ofi

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: component ofi open function successful

[libhe-dy-c5n18xlarge-1:11319] mca:base:select: Auto-selecting mtl components

[libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Querying component [ofi]

[libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Query of component [ofi] set priority to 25

[libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Selected component [ofi]

[libhe-dy-c5n18xlarge-1:11319] select: initializing mtl component ofi

[libhe-dy-c5n18xlarge-1:11319] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm

[libhe-dy-c5n18xlarge-2:11203] mca:base:select: Auto-selecting mtl components

[libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Querying component [ofi]

[libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Query of component [ofi] set priority to 25

[libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Selected component [ofi]

[libhe-dy-c5n18xlarge-2:11203] select: initializing mtl component ofi

[libhe-dy-c5n18xlarge-2:11203] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm

[libhe-dy-c5n18xlarge-1:11319] select: init returned success

[libhe-dy-c5n18xlarge-1:11319] select: component ofi selected

[libhe-dy-c5n18xlarge-2:11203] select: init returned success

[libhe-dy-c5n18xlarge-2:11203] select: component ofi selected

Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors

Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors

[libhe-dy-c5n18xlarge-1:11319] mca: base: close: component ofi closed

[libhe-dy-c5n18xlarge-1:11319] mca: base: close: unloading component ofi

[libhe-dy-c5n18xlarge-2:11203] mca: base: close: component ofi closed

[libhe-dy-c5n18xlarge-2:11203] mca: base: close: unloading component ofi

Below is the openmpi job details: [ec2-user@ip]$ which mpirun

/opt/amazon/openmpi/bin/mpirun

[ec2-user@ip]$ cat openmpi_job

#!/bin/bash #SBATCH --job-name=hello-world-job

#SBATCH --ntasks=2 --nodes=2

#SBATCH --output=%x_%j.out

mpirun ./mpi_hello_world

[ec2-user@ip]$export OMPI_MCA_mtl_base_verbose=100

[ec2-user@ip]$sbatch openmpi_job

Any suggestions why running OpenMPI job doesn't indicate EFA is running?

2回答
1

Hello, thank you for your post. Before you run the mpirun command, please make sure you have add the EFA library to the path. Depending upon which operating sytem you are using, you may use one of the following commands[1].

Amazon Linux, Amazon Linux 2, RHEL , and CentOS

$ export LD_LIBRARY_PATH=/opt/amazon/efa/lib64:$LD_LIBRARY_PATH

Ubuntu 18.04/20.04

$ export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH

References:
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl-base.html#nccl-start-base-tests

AWS
サポートエンジニア
回答済み 2年前
  • Thanks for your reply@AWS_SamM

    I am using Amazon Linux 2, the output is still same without EFA indicate after set LD_LIBRARY_PATH=/opt/amazon/efa/lib64:$LD_LIBRARY_PATH

    [ec2-user@ip code]$ export LD_LIBRARY_PATH=/opt/amazon/efa/lib64:$LD_LIBRARY_PATH [ec2-user@ip]$ export OMPI_MCA_mtl_base_verbose=100 [ec2-user@ip]$ sbatch submit_job_openmpi Submitted batch job 5

    [ec2-user@ip]$ cat hello-world-job_5.out

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: registering framework mtl components

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: found loaded component ofi

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: component ofi register function successful

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: opening mtl components

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: found loaded component ofi

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: component ofi open function successful

0

Thanks for your reply@AWS_SamM

The output is still same without EFA indicate after set LD_LIBRARY_PATH=/opt/amazon/efa/lib64:$LD_LIBRARY_PATH

[ec2-user@ip code]$ export LD_LIBRARY_PATH=/opt/amazon/efa/lib64:$LD_LIBRARY_PATH

[ec2-user@ip]$ export OMPI_MCA_mtl_base_verbose=100

[ec2-user@ip]$ sbatch submit_job_openmpi Submitted batch job 5

[ec2-user@ip]$ cat hello-world-job_5.out

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: registering framework mtl components

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: found loaded component ofi

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: component ofi register function successful

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: opening mtl components

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: found loaded component ofi

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: component ofi open function successful

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_register: registering framework mtl components

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_register: found loaded component ofi

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_register: component ofi register function successful

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_open: opening mtl components

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_open: found loaded component ofi

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_open: component ofi open function successful

[libhe-dy-c5n18xlarge-2:11589] mca:base:select: Auto-selecting mtl components

[libhe-dy-c5n18xlarge-2:11589] mca:base:select:( mtl) Querying component [ofi]

[libhe-dy-c5n18xlarge-2:11589] mca:base:select:( mtl) Query of component [ofi] set priority to 25

[libhe-dy-c5n18xlarge-2:11589] mca:base:select:( mtl) Selected component [ofi]

[libhe-dy-c5n18xlarge-2:11589] select: initializing mtl component ofi

[libhe-dy-c5n18xlarge-2:11589] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm

[libhe-dy-c5n18xlarge-1:11601] mca:base:select: Auto-selecting mtl components

[libhe-dy-c5n18xlarge-1:11601] mca:base:select:( mtl) Querying component [ofi]

[libhe-dy-c5n18xlarge-1:11601] mca:base:select:( mtl) Query of component [ofi] set priority to 25

[libhe-dy-c5n18xlarge-1:11601] mca:base:select:( mtl) Selected component [ofi]

[libhe-dy-c5n18xlarge-1:11601] select: initializing mtl component ofi

[libhe-dy-c5n18xlarge-1:11601] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm

[libhe-dy-c5n18xlarge-2:11589] select: init returned success

[libhe-dy-c5n18xlarge-2:11589] select: component ofi selected

[libhe-dy-c5n18xlarge-1:11601] select: init returned success

[libhe-dy-c5n18xlarge-1:11601] select: component ofi selected

Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors

Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors

[libhe-dy-c5n18xlarge-2:11589] mca: base: close: component ofi closed

[libhe-dy-c5n18xlarge-2:11589] mca: base: close: unloading component ofi

[libhe-dy-c5n18xlarge-1:11601] mca: base: close: component ofi closed

[libhe-dy-c5n18xlarge-1:11601] mca: base: close: unloading component ofi

回答済み 2年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ