By using AWS re:Post, you agree to the Terms of Use

Questions tagged with High Performance Compute

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

HOWTO make sure EFA is setup correctly

I follow the guide https://www.hpcworkshops.com/07-efa/01-create-efa-cluster.html to create a HPC cluster, and running the MPI hello world application(git clone https://github.com/mpitutorial/mpitutorial). I would like to make sure EFA is setup correctly, then I follow the steps in https://www.youtube.com/watch?v=Wq8EMMXsvyo&t=9s to verify EFA with OpenMPi and IntelMPI. **The output of IntelMPI looks fine, it indicates EFA is running:** [ec2-user@ip]$ cat hello-world-job_1.out Loading intelmpi version 2021.4.0 [0] MPI startup(): Intel(R) MPI Library, Version 2021.4 Build 20210831 (id: 758087adf) [0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved. [0] MPI startup(): library kind: release [0] MPI startup(): libfabric version: 1.13.0-impi **[0] MPI startup(): libfabric provider: efa** [0] MPI startup(): File "/opt/intel/mpi/2021.4.0/etc/tuning_skx_shm-ofi_efa_100.dat" not found [0] MPI startup(): Load tuning file: "/opt/intel/mpi/2021.4.0/etc/tuning_skx_shm-ofi_efa.dat" [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 10425 libhe-dy-c5n18xlarge-1 {0} [0] MPI startup(): 1 10404 libhe-dy-c5n18xlarge-2 {0} Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors **However, the output of OpenMPI job doesn't indicate EFA is running:** [ec2-user@ip]$ cat hello-world-job_2.out [libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: registering framework mtl components [libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: found loaded component ofi [libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: component ofi register function successful [libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: opening mtl components [libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: found loaded component ofi [libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: component ofi open function successful [libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: registering framework mtl components [libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: found loaded component ofi [libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: component ofi register function successful [libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: opening mtl components [libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: found loaded component ofi [libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: component ofi open function successful [libhe-dy-c5n18xlarge-1:11319] mca:base:select: Auto-selecting mtl components [libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Querying component [ofi] [libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Query of component [ofi] set priority to 25 [libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Selected component [ofi] [libhe-dy-c5n18xlarge-1:11319] select: initializing mtl component ofi [libhe-dy-c5n18xlarge-1:11319] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm [libhe-dy-c5n18xlarge-2:11203] mca:base:select: Auto-selecting mtl components [libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Querying component [ofi] [libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Query of component [ofi] set priority to 25 [libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Selected component [ofi] [libhe-dy-c5n18xlarge-2:11203] select: initializing mtl component ofi [libhe-dy-c5n18xlarge-2:11203] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm [libhe-dy-c5n18xlarge-1:11319] select: init returned success [libhe-dy-c5n18xlarge-1:11319] select: component ofi selected [libhe-dy-c5n18xlarge-2:11203] select: init returned success [libhe-dy-c5n18xlarge-2:11203] select: component ofi selected Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors [libhe-dy-c5n18xlarge-1:11319] mca: base: close: component ofi closed [libhe-dy-c5n18xlarge-1:11319] mca: base: close: unloading component ofi [libhe-dy-c5n18xlarge-2:11203] mca: base: close: component ofi closed [libhe-dy-c5n18xlarge-2:11203] mca: base: close: unloading component ofi **Below is the openmpi job details:** [ec2-user@ip]$ which mpirun /opt/amazon/openmpi/bin/mpirun [ec2-user@ip]$ cat openmpi_job #!/bin/bash #SBATCH --job-name=hello-world-job #SBATCH --ntasks=2 --nodes=2 #SBATCH --output=%x_%j.out mpirun ./mpi_hello_world [ec2-user@ip]$export OMPI_MCA_mtl_base_verbose=100 [ec2-user@ip]$sbatch openmpi_job **Any suggestions why running OpenMPI job doesn't indicate EFA is running?**
2
answers
0
votes
52
views
asked 4 months ago

ECS Capacity Provider Auto-Scaler Instance Selection

Hello, I am working with AWS ECS capacity providers to scale out instances for jobs we run. Those jobs have a large variation in the amount of memory that is needed per ECS task. Those memory needs are set at the task and container level. We have a capacity provider that is connected to an EC2 auto scaling group (asg). The asg has the instance selection so that we specify instance attributes. Here we gave it a large range for memory and cpu, and it shows hundreds of possible instances. When we run a small job (1GB of memory) it scales up a `m5.large` and `m6i.large` instance and the job runs. This is great because our task runs but the instance it selected is much larger than our needs. We then let the asg scale back down to 0. We then run a large job (16GB) and it begins scaling up. But it starts the same instance types as before. The instance types have 8GB of memory when our task needs double that on a single instance. In the case of the small job I would have expected the capacity provider to scale up only 1 instance that was closer in size to the memory needs to the job (1GB). And for the larger job I would have expected the capacity provider to scale up only 1 instance that had more than 16GB of memory to accommodate the job (16GB). Questions: * Is there a way to get capacity providers and autoscaling groups to be more responsive to the resource needs of the pending tasks? * Are there any configs I might have wrong? * Am I understanding something incorrectly? Are there any resources you would point me towards? * Is there a better approach to accomplish what I want with ECS? * Is the behavior I outlined actually to be expected? Thank you
1
answers
0
votes
89
views
asked 4 months ago