Questions tagged with High Performance Compute

Content language: English

Sort by most recent

Browse through the questions and answers listed below or filter and sort to narrow down your results.

Get current instance features from within said instance

I've been working on some code that would benefit from some level of awareness about the platform on which its running. When it runs on bare metal, several options are available (lshw, hwloc and so on). In EC2 instances, this task is not so straight forward, as they run on virtualization (excluding bare metal instances, evidently). Running 'lshw' for instance, lists the hardware, that not necessarily corresponds with available resources. As an example, running lshw on a t2.micro instance, which has 1 default core available, gives the actual model of the CPU on which it is running, a Intel Xeon with 12 cores. I understand that I am able to fetch [instance metadata](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-instance-metadata.html), find which instance type the code is running on and use AWS CLI and/or EC2 API to get [the description of the instance](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/ec2-api.pdf). The issue with that workaround is that it presupposes that the current instance has either the AWS CLI configured with proper credentials or that the user credentials are available as environment variables to the system, which may or may not be true. I've been looking for a more general solution, that could work, at least, on the most popular Linux distros, such as querying the system about actually available resources (cpus cores, threads, memory, cache and accelerators) but have so far failed to find a suitable solution. Is this possible? Or in this circumstances such query is not a possibility?
1
answers
0
votes
31
views
asked 5 months ago

HOWTO make sure EFA is setup correctly

I follow the guide https://www.hpcworkshops.com/07-efa/01-create-efa-cluster.html to create a HPC cluster, and running the MPI hello world application(git clone https://github.com/mpitutorial/mpitutorial). I would like to make sure EFA is setup correctly, then I follow the steps in https://www.youtube.com/watch?v=Wq8EMMXsvyo&t=9s to verify EFA with OpenMPi and IntelMPI. **The output of IntelMPI looks fine, it indicates EFA is running:** [ec2-user@ip]$ cat hello-world-job_1.out Loading intelmpi version 2021.4.0 [0] MPI startup(): Intel(R) MPI Library, Version 2021.4 Build 20210831 (id: 758087adf) [0] MPI startup(): Copyright (C) 2003-2021 Intel Corporation. All rights reserved. [0] MPI startup(): library kind: release [0] MPI startup(): libfabric version: 1.13.0-impi **[0] MPI startup(): libfabric provider: efa** [0] MPI startup(): File "/opt/intel/mpi/2021.4.0/etc/tuning_skx_shm-ofi_efa_100.dat" not found [0] MPI startup(): Load tuning file: "/opt/intel/mpi/2021.4.0/etc/tuning_skx_shm-ofi_efa.dat" [0] MPI startup(): Rank Pid Node name Pin cpu [0] MPI startup(): 0 10425 libhe-dy-c5n18xlarge-1 {0} [0] MPI startup(): 1 10404 libhe-dy-c5n18xlarge-2 {0} Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors **However, the output of OpenMPI job doesn't indicate EFA is running:** [ec2-user@ip]$ cat hello-world-job_2.out [libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: registering framework mtl components [libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: found loaded component ofi [libhe-dy-c5n18xlarge-1:11319] mca: base: components_register: component ofi register function successful [libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: opening mtl components [libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: found loaded component ofi [libhe-dy-c5n18xlarge-1:11319] mca: base: components_open: component ofi open function successful [libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: registering framework mtl components [libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: found loaded component ofi [libhe-dy-c5n18xlarge-2:11203] mca: base: components_register: component ofi register function successful [libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: opening mtl components [libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: found loaded component ofi [libhe-dy-c5n18xlarge-2:11203] mca: base: components_open: component ofi open function successful [libhe-dy-c5n18xlarge-1:11319] mca:base:select: Auto-selecting mtl components [libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Querying component [ofi] [libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Query of component [ofi] set priority to 25 [libhe-dy-c5n18xlarge-1:11319] mca:base:select:( mtl) Selected component [ofi] [libhe-dy-c5n18xlarge-1:11319] select: initializing mtl component ofi [libhe-dy-c5n18xlarge-1:11319] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm [libhe-dy-c5n18xlarge-2:11203] mca:base:select: Auto-selecting mtl components [libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Querying component [ofi] [libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Query of component [ofi] set priority to 25 [libhe-dy-c5n18xlarge-2:11203] mca:base:select:( mtl) Selected component [ofi] [libhe-dy-c5n18xlarge-2:11203] select: initializing mtl component ofi [libhe-dy-c5n18xlarge-2:11203] mtl_ofi_component.c:366: mtl:ofi:provider: rdmap0s6-rdm [libhe-dy-c5n18xlarge-1:11319] select: init returned success [libhe-dy-c5n18xlarge-1:11319] select: component ofi selected [libhe-dy-c5n18xlarge-2:11203] select: init returned success [libhe-dy-c5n18xlarge-2:11203] select: component ofi selected Hello world from processor libhe-dy-c5n18xlarge-1, rank 0 out of 2 processors Hello world from processor libhe-dy-c5n18xlarge-2, rank 1 out of 2 processors [libhe-dy-c5n18xlarge-1:11319] mca: base: close: component ofi closed [libhe-dy-c5n18xlarge-1:11319] mca: base: close: unloading component ofi [libhe-dy-c5n18xlarge-2:11203] mca: base: close: component ofi closed [libhe-dy-c5n18xlarge-2:11203] mca: base: close: unloading component ofi **Below is the openmpi job details:** [ec2-user@ip]$ which mpirun /opt/amazon/openmpi/bin/mpirun [ec2-user@ip]$ cat openmpi_job #!/bin/bash #SBATCH --job-name=hello-world-job #SBATCH --ntasks=2 --nodes=2 #SBATCH --output=%x_%j.out mpirun ./mpi_hello_world [ec2-user@ip]$export OMPI_MCA_mtl_base_verbose=100 [ec2-user@ip]$sbatch openmpi_job **Any suggestions why running OpenMPI job doesn't indicate EFA is running?**
2
answers
0
votes
79
views
asked 6 months ago

ECS Capacity Provider Auto-Scaler Instance Selection

Hello, I am working with AWS ECS capacity providers to scale out instances for jobs we run. Those jobs have a large variation in the amount of memory that is needed per ECS task. Those memory needs are set at the task and container level. We have a capacity provider that is connected to an EC2 auto scaling group (asg). The asg has the instance selection so that we specify instance attributes. Here we gave it a large range for memory and cpu, and it shows hundreds of possible instances. When we run a small job (1GB of memory) it scales up a `m5.large` and `m6i.large` instance and the job runs. This is great because our task runs but the instance it selected is much larger than our needs. We then let the asg scale back down to 0. We then run a large job (16GB) and it begins scaling up. But it starts the same instance types as before. The instance types have 8GB of memory when our task needs double that on a single instance. In the case of the small job I would have expected the capacity provider to scale up only 1 instance that was closer in size to the memory needs to the job (1GB). And for the larger job I would have expected the capacity provider to scale up only 1 instance that had more than 16GB of memory to accommodate the job (16GB). Questions: * Is there a way to get capacity providers and autoscaling groups to be more responsive to the resource needs of the pending tasks? * Are there any configs I might have wrong? * Am I understanding something incorrectly? Are there any resources you would point me towards? * Is there a better approach to accomplish what I want with ECS? * Is the behavior I outlined actually to be expected? Thank you
1
answers
0
votes
123
views
asked 6 months ago