How can I configure ParallelCluster with slurm to allow multiple jobs on nodes with multiple GPU?

0

I'm trying to run many jobs that each use 1 GPU on ParallelCluster with the slurm scheduler. I'd like to use instances with 8 GPUs, like p3.16xlarge, and have slurm schedule 8 jobs to each node. I run each job like this to ensure it only requests 1 GPU:

srun -p p3-16xlarge --gpus-per-node=1 myapp

The partition p3-16xlarge has nodes of type p3.16xlarge. This command appears to work since slurm sets CUDA_VISIBLE_DEVICES=0 to indicate that only 1 GPU should be used. sinfo shows that the instance is in the mixed state when the job runs, so not all the cpus are assigned. The problem is if I try to run the above command a second time to start a second job, slurm will boot up a new instance rather than assigning the job to the running instance that has available CPUs and GPUs.

On other slurm clusters I've been able to get this to work as I intend and have multiple jobs assigned to a node with multiple GPUs. With ParallelCluster I can get slurm to assign multiple jobs to the same instance on nodes without GPUs. So it seems that the way ParallelCluster is configuring nodes with GPUs prevents those nodes from accepting more than one job. I've read through the config files and suspect that this could be related to gres.conf and that the config doesn't contain the Count attribute on the GPU config. Hoping an expert can point me in the right direction!

areid
asked 9 months ago634 views
1 Answer
0

Hi! I think ParallelCluster by default support you to run multiple jobs on a nodes that with multiple GPUs. Based on my testing, I couldn't reproduce the issue with the g3.16xlarge instances that have 4 GPUs. When submitting two jobs, each requiring one GPU, Slurm correctly assigned both jobs to the same node, with each job use one GPU. Could you can check the CPU allocation and GPU allocation on the job with running scontrol show jobs --details within 5 minutes when the job is finished? Here's an example:

  • Submit 2 jobs, each job requires 1 GPU
[ec2-user@ip-192-168-60-235 ~]$ srun -p queue-1 --gpus-per-node=1 sleep 120
  • two jobs are running on the same node, each job use 1 GPu
[ec2-user@ip-192-168-60-235 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3   queue-1    sleep ec2-user  R       1:09      1 queue-1-dy-test-resume-1
                 4   queue-1    sleep ec2-user  R       0:09      1 queue-1-dy-test-resume-1
  • Run scontrol show jobs --details within 5 minutes when the job is finished.?
  • As show in the following you can see:
  • Job 3 is using Nodes=queue-1-dy-test-resume-1 CPU_IDs=0, GRES=gpu:m60:1(IDX:0)
  • Job 4 is using Nodes=queue-1-dy-test-resume-1 but on another cpu_id: CPU_IDs=1 and another gpu_id: GRES=gpu:m60:1(IDX:1)
[ec2-user@ip-192-168-60-235 ~]$ scontrol show jobs --details
JobId=3 JobName=sleep
   UserId=ec2-user(1000) GroupId=ec2-user(1000) MCS_label=N/A
   Priority=4294901757 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:01:15 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-08-07T23:02:38 EligibleTime=2023-08-07T23:02:38
   AccrueTime=Unknown
   StartTime=2023-08-07T23:02:38 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-08-07T23:02:38 Scheduler=Main
   Partition=queue-1 AllocNode:Sid=ip-192-168-60-235:9602
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=queue-1-dy-test-resume-1
   BatchHost=queue-1-dy-test-resume-1
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=474726M,node=1,billing=1
   AllocTRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=gpu:m60:1
     Nodes=queue-1-dy-test-resume-1 CPU_IDs=0 Mem=0 GRES=gpu:m60:1(IDX:0)
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=sleep
   WorkDir=/home/ec2-user
   Power=
   TresPerNode=gres:gpu:1

JobId=4 JobName=sleep
   UserId=ec2-user(1000) GroupId=ec2-user(1000) MCS_label=N/A
   Priority=4294901756 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:15 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2023-08-07T23:03:38 EligibleTime=2023-08-07T23:03:38
   AccrueTime=Unknown
   StartTime=2023-08-07T23:03:38 EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-08-07T23:03:38 Scheduler=Main
   Partition=queue-1 AllocNode:Sid=ip-192-168-60-235:10855
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=queue-1-dy-test-resume-1
   BatchHost=queue-1-dy-test-resume-1
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=1,mem=474726M,node=1,billing=1
   AllocTRES=cpu=1,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   JOB_GRES=gpu:m60:1
     Nodes=queue-1-dy-test-resume-1 CPU_IDs=1 Mem=0 GRES=gpu:m60:1(IDX:1)
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=sleep
   WorkDir=/home/ec2-user
   Power=
   TresPerNode=gres:gpu:1

Let me know if you could find any useful information from scontrol show jobs --details

Thanks

answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions