Hello,
I want to run a simulation on a parallel cluster (alinux2) using 2 hpc6a.48xlarge instances (192 CPUs). I created the cluster and submitted the job using slurm. The problem is that the job stays waiting in the queue and never runs (I left it for more than 1 day). I tried running the same job using another kind of instance, with the same number of CPUs and it worked perfectly, so it is an issue with this specific instance (hpc6a.48xlarge).
I also tried using only 1 hpc6a.48xlarge instance (96 CPUs) but did not work either. I copy the squeue information at the end of the message. It shows some 'BeginTime' reasons, although I have not programmed my job to start later.
What may be the reason for this issue?
I am creating the cluster on a new company account. May the issue be related to the usage of the account? I ask this because I have already configured the same cluster on a personal account (with significantly more usage than the company account) and the job is executed almost immediately.
I would appreciate any advice on resolving this issue
[ec2-user@ip- OpenFOAM]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)
[ec2-user@ip- OpenFOAM]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)
[ec2-user@ip- OpenFOAM]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)
[ec2-user@ip- OpenFOAM]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute foam-64 ec2-user PD 0:00 1 (None)
[ec2-user@ip- OpenFOAM]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute foam-64 ec2-user PD 0:00 1 (None)
[ec2-user@ip- OpenFOAM]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)
[ec2-user@ip- OpenFOAM]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)
[ec2-user@ip- OpenFOAM]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)
[ec2-user@ip- OpenFOAM]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute foam-64 ec2-user PD 0:00 1 (BeginTime)
[ec2-user@ip- OpenFOAM]$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2 compute foam-64 ec2-user PD 0:00 1 (None)
These are the last lines of the clustermgtd file:
2022-04-01 02:01:44,064 - [slurm_plugin.clustermgtd:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_clustermgtd.conf 2022-04-01 02:01:44,065 - [slurm_plugin.common:read_json] - INFO - Unable to read file '/opt/slurm/etc/pcluster/run_instances_overrides.json'. Using default: {} 2022-04-01 02:01:44,066 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Managing cluster... 2022-04-01 02:01:44,072 - [slurm_plugin.clustermgtd:_manage_compute_fleet_status_transitions] - INFO - Current compute fleet status: RUNNING 2022-04-01 02:01:44,072 - [slurm_plugin.clustermgtd:manage_cluster] - INFO - Retrieving nodes info from the scheduler 2022-04-01 02:01:49,148 - [slurm_plugin.clustermgtd:_get_ec2_instances] - INFO - Retrieving list of EC2 instances associated with the cluster 2022-04-01 02:01:49,211 - [slurm_plugin.clustermgtd:_perform_health_check_actions] - INFO - Performing instance health check actions 2022-04-01 02:01:49,211 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Performing node maintenance actions 2022-04-01 02:01:49,211 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x0) [] 2022-04-01 02:01:49,211 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Found the following unhealthy dynamic nodes: (x1) ['compute-dy-defaultcompute-9(compute-dy-defaultcompute-9)']
And these are the last lines of slurm_resume.log:
2022-04-01 02:03:41,449 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2022-04-01 02:02:49.260630+00:00 2022-04-01 02:03:41,449 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: compute-dy-defaultcompute-10 2022-04-01 02:03:41,513 - [slurm_plugin.instance_manager:add_instances_for_nodes] - INFO - Launching instances for slurm nodes (x1) ['compute-dy-defaultcompute-10'] 2022-04-01 02:03:42,293 - [slurm_plugin.instance_manager:_launch_ec2_instances] - ERROR - Failed RunInstances request: d40261ce-e840-4e92-857b-a86c2820c73b 2022-04-01 02:03:42,293 - [slurm_plugin.instance_manager:add_instances_for_nodes] - ERROR - Encountered exception when launching instances for nodes (x1) ['compute-dy-defaultcompute-10']: An error occurred (VcpuLimitExceeded) when calling the RunInstances operation: You have requested more vCPU capacity than your current vCPU limit of 0 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. 2022-04-01 02:03:42,294 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x0) [] 2022-04-01 02:03:42,294 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to down: (x1) ['compute-dy-defaultcompute-10']