Job on parallel cluster using hpc6a.48xlarge not running

0

Hello,

I want to run a simulation on a parallel cluster (alinux2) using 2 hpc6a.48xlarge instances (192 CPUs). I created the cluster and submitted the job using slurm. The problem is that the job stays waiting in the queue and never runs (I left it for more than 1 day). I tried running the same job using another kind of instance, with the same number of CPUs and it worked perfectly, so it is an issue with this specific instance (hpc6a.48xlarge). I also tried using only 1 hpc6a.48xlarge instance (96 CPUs) but did not work either. I copy the squeue information at the end of the message. It shows some 'BeginTime' reasons, although I have not programmed my job to start later. What may be the reason for this issue? I am creating the cluster on a new company account. May the issue be related to the usage of the account? I ask this because I have already configured the same cluster on a personal account (with significantly more usage than the company account) and the job is executed almost immediately.

I would appreciate any advice on resolving this issue

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (None)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (None)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)

[ec2-user@ip- OpenFOAM]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 2 compute foam-64 ec2-user PD 0:00 1 (None)

asked 2 years ago638 views
4 Answers
1

Can you take a look at clustermgtd and slurm_resume.log/var/log/parallelcluster/ for details of the launch of the compute nodes?

AWS
answered 2 years ago
  • These are the last lines of the clustermgtd file:

    2022-04-01 02:01:44,064 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2022-04-01 02:01:44,065 - [slurm_plugin.common:read_json] - INFO - Unable to read file '/opt/slurm/etc/pcluster/run_instances_overrides.json'. Using default: {} 2022-04-01 02:01:44,066 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2022-04-01 02:01:44,072 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2022-04-01 02:01:44,072 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2022-04-01 02:01:49,148 - [slurm_plugin.clustermgtd:_get_ec2_instances] - INFO - Retrieving list of EC2 instances associated with the cluster 2022-04-01 02:01:49,211 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2022-04-01 02:01:49,211 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2022-04-01 02:01:49,211 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2022-04-01 02:01:49,211 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Found the following unhealthy dynamic nodes: (x1) ['compute-dy-defaultcompute-9(compute-dy-defaultcompute-9)']

  • And these are the last lines of slurm_resume.log:

    2022-04-01 02:03:41,449 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2022-04-01 02:02:49.260630+00:00 2022-04-01 02:03:41,449 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: compute-dy-defaultcompute-10 2022-04-01 02:03:41,513 - [slurm_plugin.instance_manager:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['compute-dy-defaultcompute-10'] 2022-04-01 02:03:42,293 - [slurm_plugin.instance_manager:_launch_ec2_instances] - ERROR - Failed RunInstances request: d40261ce-e840-4e92-857b-a86c2820c73b 2022-04-01 02:03:42,293 - [slurm_plugin.instance_manager:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x1) ['compute-dy-defaultcompute-10']: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. 2022-04-01 02:03:42,294 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x0) [] 2022-04-01 02:03:42,294 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to down: (x1) ['compute-dy-defaultcompute-10']

1

There is a chance that your new account doesn't have the requisite quota for hpc6a. Please open a support case asking for a limit increase, or reach out to your account manager or solution architect to help.

AWS
answered 2 years ago
0

Thank you for all your comments. I have requested a new limit increase to 514 vCPUs (I had 240 vCPUs). Hopefully, that will solve the issue.

answered 2 years ago
0

There is a specific quota for the hpc6a. Probably you need to increase it too

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions