Job on parallel cluster using hpc6a.48xlarge not running

Question

Hello,

I want to run a simulation on a parallel cluster (alinux2) using 2 hpc6a.48xlarge instances (192 CPUs). I created the cluster and submitted the job using slurm. The problem is that the job stays waiting in the queue and never runs (I left it for more than 1 day). I tried running the same job using another kind of instance, with the same number of CPUs and it worked perfectly, so it is an issue with this specific instance (hpc6a.48xlarge). 
I also tried using only 1 hpc6a.48xlarge instance (96 CPUs) but did not work either. I copy the squeue information at the end of the message. It shows some 'BeginTime' reasons, although I have not programmed my job to start later. 
What may be the reason for this issue?
I am creating the cluster on a new company account. May the issue be related to the usage of the account?  I ask this because I have already configured the same cluster on a personal account (with significantly more usage than the company account) and the job is executed almost immediately.

I would appreciate any advice on resolving this issue

[ec2-user@ip- OpenFOAM]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute  foam-64 ec2-user PD       0:00      1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute  foam-64 ec2-user PD       0:00      1 (None)

[ec2-user@ip- OpenFOAM]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute  foam-64 ec2-user PD       0:00      1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2   compute  foam-64 ec2-user PD       0:00      1 (None)

Answer

There is a chance that your new account doesn't have the requisite quota for hpc6a. Please open a support case asking for a limit increase, or reach out to your account manager or solution architect to help.

Answer

Can you take a look at clustermgtd and slurm_resume.log`/var/log/parallelcluster/` for details of the launch of the compute nodes?

Answer

Thank you for all your comments. 
I have requested a new limit increase to 514 vCPUs (I had 240 vCPUs). Hopefully, that will solve the issue.

Answer

There is a specific quota for the hpc6a. Probably you need to increase it too

Job on parallel cluster using hpc6a.48xlarge not running

Contenuto pertinente