HOWTO make sure EFA is setup correctly

1

I follow the guide https://www.hpcworkshops.com/07-efa/01-create-efa-cluster.html to create a HPC cluster, and running the MPI hello world application(git clone https://github.com/mpitutorial/mpitutorial). I would like to make sure EFA is setup correctly, then I follow the steps in https://www.youtube.com/watch?v=Wq8EMMXsvyo&t=9s to verify EFA with OpenMPi and IntelMPI.

The output of IntelMPI looks fine, it indicates EFA is running:

[ec2-user@ip]$ cat hello-world-job_1.out

Loading intelmpi version 2021.4.0

[0] MPI startup(): Intel(R) MPI Library, Version 2021.4 Build 20210831 (id: 758087adf)

[0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved.

[0] MPI startup(): library kind: release

[0] MPI startup(): libfabric version: 1.13.0-impi

[0] MPI startup(): libfabric provider: efa

[0] MPI startup(): File "/opt/intel/mpi/2021.4.0/etc/tuning_skx_shm-ofi_efa_100.dat" not found

[0] MPI startup(): Load tuning file: "/opt/intel/mpi/2021.4.0/etc/tuning_skx_shm-ofi_efa.dat"

[0] MPI startup(): Rank Pid Node name Pin cpu

[0] MPI startup(): 0 10425 libhe-dy-c5n18xlarge-1 {0}

[0] MPI startup(): 1 10404 libhe-dy-c5n18xlarge-2 {0}

Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors

Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors

However, the output of OpenMPI job doesn't indicate EFA is running:

[ec2-user@ip]$ cat hello-world-job_2.out

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: registering framework mtl components

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: found loaded component ofi

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: component ofi register function successful

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: opening mtl components

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: found loaded component ofi

[libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: component ofi open function successful

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: registering framework mtl components

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: found loaded component ofi

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: component ofi register function successful

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: opening mtl components

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: found loaded component ofi

[libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: component ofi open function successful

[libhe-dy-c5n18xlarge-1:11319] mca:base:select: Auto-selecting mtl components

[libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Querying component [ofi]

[libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Query of component [ofi] set priority to 25

[libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Selected component [ofi]

[libhe-dy-c5n18xlarge-1:11319] select: initializing mtl component ofi

[libhe-dy-c5n18xlarge-1:11319] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm

[libhe-dy-c5n18xlarge-2:11203] mca:base:select: Auto-selecting mtl components

[libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Querying component [ofi]

[libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Query of component [ofi] set priority to 25

[libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Selected component [ofi]

[libhe-dy-c5n18xlarge-2:11203] select: initializing mtl component ofi

[libhe-dy-c5n18xlarge-2:11203] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm

[libhe-dy-c5n18xlarge-1:11319] select: init returned success

[libhe-dy-c5n18xlarge-1:11319] select: component ofi selected

[libhe-dy-c5n18xlarge-2:11203] select: init returned success

[libhe-dy-c5n18xlarge-2:11203] select: component ofi selected

Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors

Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors

[libhe-dy-c5n18xlarge-1:11319] mca: base: close: component ofi closed

[libhe-dy-c5n18xlarge-1:11319] mca: base: close: unloading component ofi

[libhe-dy-c5n18xlarge-2:11203] mca: base: close: component ofi closed

[libhe-dy-c5n18xlarge-2:11203] mca: base: close: unloading component ofi

Below is the openmpi job details: [ec2-user@ip]$ which mpirun

/opt/amazon/openmpi/bin/mpirun

[ec2-user@ip]$ cat openmpi_job

#!/bin/bash #SBATCH --job-name=hello-world-job

#SBATCH --ntasks=2 --nodes=2

#SBATCH --output=%x_%j.out

mpirun ./mpi_hello_world

[ec2-user@ip]$export OMPI_MCA_mtl_base_verbose=100

[ec2-user@ip]$sbatch openmpi_job

Any suggestions why running OpenMPI job doesn't indicate EFA is running?

질문됨 2년 전661회 조회
2개 답변
1

Hello, thank you for your post. Before you run the mpirun command, please make sure you have add the EFA library to the path. Depending upon which operating sytem you are using, you may use one of the following commands[1].

Amazon Linux, Amazon Linux 2, RHEL , and CentOS

$ export LD_LIBRARY_PATH=/opt/amazon/efa/lib64:$LD_LIBRARY_PATH

Ubuntu 18.04/20.04

$ export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH

References:
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl-base.html#nccl-start-base-tests

AWS
지원 엔지니어
답변함 2년 전
  • Thanks for your reply@AWS_SamM

    I am using Amazon Linux 2, the output is still same without EFA indicate after set LD_LIBRARY_PATH=/opt/amazon/efa/lib64:$LD_LIBRARY_PATH

    [ec2-user@ip code]$ export LD_LIBRARY_PATH=/opt/amazon/efa/lib64:$LD_LIBRARY_PATH [ec2-user@ip]$ export OMPI_MCA_mtl_base_verbose=100 [ec2-user@ip]$ sbatch submit_job_openmpi Submitted batch job 5

    [ec2-user@ip]$ cat hello-world-job_5.out

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: registering framework mtl components

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: found loaded component ofi

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: component ofi register function successful

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: opening mtl components

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: found loaded component ofi

    [libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: component ofi open function successful

0

Thanks for your reply@AWS_SamM

The output is still same without EFA indicate after set LD_LIBRARY_PATH=/opt/amazon/efa/lib64:$LD_LIBRARY_PATH

[ec2-user@ip code]$ export LD_LIBRARY_PATH=/opt/amazon/efa/lib64:$LD_LIBRARY_PATH

[ec2-user@ip]$ export OMPI_MCA_mtl_base_verbose=100

[ec2-user@ip]$ sbatch submit_job_openmpi Submitted batch job 5

[ec2-user@ip]$ cat hello-world-job_5.out

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: registering framework mtl components

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: found loaded component ofi

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_register: component ofi register function successful

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: opening mtl components

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: found loaded component ofi

[libhe-dy-c5n18xlarge-2:11589] mca: base: components_open: component ofi open function successful

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_register: registering framework mtl components

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_register: found loaded component ofi

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_register: component ofi register function successful

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_open: opening mtl components

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_open: found loaded component ofi

[libhe-dy-c5n18xlarge-1:11601] mca: base: components_open: component ofi open function successful

[libhe-dy-c5n18xlarge-2:11589] mca:base:select: Auto-selecting mtl components

[libhe-dy-c5n18xlarge-2:11589] mca:base:select:( mtl) Querying component [ofi]

[libhe-dy-c5n18xlarge-2:11589] mca:base:select:( mtl) Query of component [ofi] set priority to 25

[libhe-dy-c5n18xlarge-2:11589] mca:base:select:( mtl) Selected component [ofi]

[libhe-dy-c5n18xlarge-2:11589] select: initializing mtl component ofi

[libhe-dy-c5n18xlarge-2:11589] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm

[libhe-dy-c5n18xlarge-1:11601] mca:base:select: Auto-selecting mtl components

[libhe-dy-c5n18xlarge-1:11601] mca:base:select:( mtl) Querying component [ofi]

[libhe-dy-c5n18xlarge-1:11601] mca:base:select:( mtl) Query of component [ofi] set priority to 25

[libhe-dy-c5n18xlarge-1:11601] mca:base:select:( mtl) Selected component [ofi]

[libhe-dy-c5n18xlarge-1:11601] select: initializing mtl component ofi

[libhe-dy-c5n18xlarge-1:11601] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm

[libhe-dy-c5n18xlarge-2:11589] select: init returned success

[libhe-dy-c5n18xlarge-2:11589] select: component ofi selected

[libhe-dy-c5n18xlarge-1:11601] select: init returned success

[libhe-dy-c5n18xlarge-1:11601] select: component ofi selected

Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors

Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors

[libhe-dy-c5n18xlarge-2:11589] mca: base: close: component ofi closed

[libhe-dy-c5n18xlarge-2:11589] mca: base: close: unloading component ofi

[libhe-dy-c5n18xlarge-1:11601] mca: base: close: component ofi closed

[libhe-dy-c5n18xlarge-1:11601] mca: base: close: unloading component ofi

답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠