Job on parallel cluster using hpc6a.48xlarge not running

0

Hello,

I want to run a simulation on a parallel cluster (alinux2) using 2 hpc6a.48xlarge instances (192 CPUs). I created the cluster and submitted the job using slurm. The problem is that the job stays waiting in the queue and never runs (I left it for more than 1 day). I tried running the same job using another kind of instance, with the same number of CPUs and it worked perfectly, so it is an issue with this specific instance (hpc6a.48xlarge). I also tried using only 1 hpc6a.48xlarge instance (96 CPUs) but did not work either. I copy the squeue information at the end of the message. It shows some 'BeginTime' reasons, although I have not programmed my job to start later. What may be the reason for this issue? I am creating the cluster on a new company account. May the issue be related to the usage of the account? I ask this because I have already configured the same cluster on a personal account (with significantly more usage than the company account) and the job is executed almost immediately.

I would appreciate any advice on resolving this issue

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (None)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (None)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (None)

gefragt vor 2 Jahren646 Aufrufe
4 Antworten
1

Can you take a look at clustermgtd and slurm_resume.log/var/log/parallelcluster/ for details of the launch of the compute nodes?

AWS
beantwortet vor 2 Jahren
  • These are the last lines of the clustermgtd file:

    2022-04-01 02:01:44,064 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2022-04-01 02:01:44,065 - [slurm_plugin.common:read_json] - INFO - Unable to read file '/opt/slurm/etc/pcluster/run_instances_overrides.json'. Using default: {} 2022-04-01 02:01:44,066 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2022-04-01 02:01:44,072 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2022-04-01 02:01:44,072 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2022-04-01 02:01:49,148 - [slurm_plugin.clustermgtd:_get_ec2_instances] - INFO - Retrieving list of EC2 instances associated with the cluster 2022-04-01 02:01:49,211 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2022-04-01 02:01:49,211 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2022-04-01 02:01:49,211 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2022-04-01 02:01:49,211 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Found the following unhealthy dynamic nodes: (x1) ['compute-dy-defaultcompute-9(compute-dy-defaultcompute-9)']

  • And these are the last lines of slurm_resume.log:

    2022-04-01 02:03:41,449 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2022-04-01 02:02:49.260630+00:00 2022-04-01 02:03:41,449 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: compute-dy-defaultcompute-10 2022-04-01 02:03:41,513 - [slurm_plugin.instance_manager:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['compute-dy-defaultcompute-10'] 2022-04-01 02:03:42,293 - [slurm_plugin.instance_manager:_launch_ec2_instances] - ERROR - Failed RunInstances request: d40261ce-e840-4e92-857b-a86c2820c73b 2022-04-01 02:03:42,293 - [slurm_plugin.instance_manager:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x1) ['compute-dy-defaultcompute-10']: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. 2022-04-01 02:03:42,294 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x0) [] 2022-04-01 02:03:42,294 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to down: (x1) ['compute-dy-defaultcompute-10']

1

There is a chance that your new account doesn't have the requisite quota for hpc6a. Please open a support case asking for a limit increase, or reach out to your account manager or solution architect to help.

AWS
beantwortet vor 2 Jahren
0

Thank you for all your comments. I have requested a new limit increase to 514 vCPUs (I had 240 vCPUs). Hopefully, that will solve the issue.

beantwortet vor 2 Jahren
0

There is a specific quota for the hpc6a. Probably you need to increase it too

AWS
beantwortet vor 2 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen