- Más nuevo
- Más votos
- Más comentarios
Hi @mfolusiak,
Thanks for your information. Based on your submit_args, job submission command was using -l ncpus=2 to specify the number of vcpus, if you replace the resource arg with -l nodes=1:ppn=2, this resource arg will solve overload issue and allocate job to different instances according to the instance vcpus capacity.
nodes - specifies the number of separate nodes that should be allocated
ppn - how many processes to allocate for each node
~Yulei
Edited by: yulei-AWS on Feb 12, 2021 4:02 PM
Hi @mfolusiak,
Did you use OpenMPI to execute your job in job script? If so, this is an expected behavior called oversubscribing, you can check the details in this link https://www.open-mpi.org/faq/?category=running#oversubscribing. Meanwhile, did you specify --hostfile option in your job script? You can check the hostfile to see if there are more than 8 slots in settings. If the hostfile was specified more than 8 slots despite only 8 slots being available on a single c5.4xlarge instance with disabled hyperthreading, an oversubscribing issue was probably occurring that could have severely degraded the performances due to MPI processes being executed in aggressive mode with an expected sufficient number of slots.
If it's not the case mentioned above, please provide job script/hostfile/job submission command, thank you.
~Yulei
Edited by: AWS-yuleiwan on Feb 11, 2021 11:19 AM
Hi @AWS-yuleiwan,
I am not using MPI, I am using OpenMP though, but with the same number of threads I reserve for the job.
See the detailed qstat report below. The job submission arguments are in submit_args
I believe. As you can see, the python script is launched there that further launches executable utilizing the same number of threads as the number of cpus for the job.
[code]
Job Id: 518.ip-172-31-24-41.eu-central-1.compute.internal
Job_Name = 012310
Job_Owner = flacscloud@ip-172-31-24-41.eu-central-1.compute.internal
resources_used.cput = 21:59:50
resources_used.energy_used = 0
resources_used.mem = 422380kb
resources_used.vmem = 3728048kb
resources_used.walltime = 23:12:53
job_state = R
queue = batch
server = ip-172-31-24-41.eu-central-1.compute.internal
Checkpoint = u
ctime = Tue Feb 9 20:03:36 2021
exec_host = ip-172-31-68-184/4
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Tue Feb 9 20:03:36 2021
Output_Path = ip-172-31-24-41.eu-central-1.compute.internal:/shared/flacsc
loud/users/chris/D04F899F-43FC-419B-B8A7-15A8D3176A6F/Auriga/D26-01231
0/012310.o518
Priority = 0
qtime = Tue Feb 9 20:03:36 2021
Rerunable = True
Resource_List.ncpus = 2
Resource_List.walltime = 48:00:00
session_id = 112846
euser = flacscloud
egroup = flacscloud
queue_type = E
comment = Job started on Tue Feb 09 at 20:03
etime = Tue Feb 9 20:03:36 2021
submit_args = -N 012310 -d /shared/flacscloud/users/chris/D04F899F-43FC-41
9B-B8A7-15A8D3176A6F/Auriga/D26-012310/ -q batch -l ncpus=2 -l walltim
e=48:00:00 -F "/shared/flacscloud/run.py 517" /install/sw/flacs/20.2/F
LACS-CFD_20.2/bin/run_python
start_time = Tue Feb 9 20:03:36 2021
Walltime.Remaining = 89191
start_count = 1
fault_tolerant = False
job_radix = 0
submit_host = ip-172-31-24-41.eu-central-1.compute.internal
init_work_dir = /shared/flacscloud/users/chris/D04F899F-43FC-419B-B8A7-15A
8D3176A6F/Auriga/D26-012310
job_arguments = "/shared/flacscloud/run.py 517"
request_version = 1
[/code]
Contenido relevante
- OFICIAL DE AWSActualizada hace 3 años
- OFICIAL DE AWSActualizada hace 2 años
- OFICIAL DE AWSActualizada hace 2 años