Parallel cluster nodes failed

0

When running a parallel cluster using a partition with c6g-medium ondemand machines, 19 of them failed during a run and never powered up again.

My sinfo returns:

PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
c6gm-ondemand    up   infinite     19  idle% c6gm-ondemand-dy-c6gmedium-[32-50]
c6gm-ondemand    up   infinite     31  alloc c6gm-ondemand-dy-c6gmedium-[1-31]

While sacct contains the following entries:

12033        2022_6_39+ c6gm-onde+                     1  NODE_FAIL      0:0
12034        2022_6_40+ c6gm-onde+                     1  NODE_FAIL      0:0
12037        2022_6_43+ c6gm-onde+                     1  NODE_FAIL      0:0
12039        2022_6_45+ c6gm-onde+                     1  NODE_FAIL      0:0
12040        2022_6_46+ c6gm-onde+                     1  NODE_FAIL      0:0

Does anyone know how I can figure out what caused these nodes to fail and never be booted up again? The other 31 ondemand nodes have been running similar task as the 19 failed nodes without problems. Also, is there any way to restart the 19 failed nodes somehow? I would really like to run 50 nodes in parallel, not 31.

EDIT: my squeue contains hundreds more PENDING jobs to be run on nodes in this partition, so I'm a bit confused why the idle% nodes aren't being powered up again.

JohnB
asked a year ago289 views
1 Answer
1
Accepted Answer

It looks like you're only using 31 nodes and the other 19 aren't in use. What does your workload look like (i.e. running a job that needs 50 nodes, running 50 jobs that need 1 node)? ParallelCluster will only provision instances needed to run jobs in the queue to leverage the elasticity of the cloud and ensure you're not paying for idle resources.

If you go to CloudWatch, there should be a CloudWatch log group with the name /aws/parallelcluster/<StackName> where StackName is the name of your ParallelCluster stack. There should be logs for each node in there if they were provisioned then failed. If they failed before provisioning, you'll want to look at the logs for the head node, there may be something useful in the slurm_resume.log entry. Failing before provisioning may also be indicative of an error such as exceeding your quota of instances (which you could increase here: https://aws.amazon.com/premiumsupport/knowledge-center/ec2-instance-limit/)

AWS
answered a year ago
profile picture
EXPERT
reviewed 10 months ago
  • Thanks for the answer. I have a queue full of hundreds of pending jobs in this partition, each of which would run on one node, but that doesn't prompt ParallelCluster to restart the failed nodes.

  • Gotcha. I think the idle% designation means the partition is idle and powering down (https://slurm.schedmd.com/elastic_computing.html). Do the EC2 instances that would be in that partition exist? Did they launch then terminate? Were you able to glean any useful information from the CloudWatch logs?

  • Pointing to the CloudWatch logs was spot on, thanks. When I first checked the status of my job queue, I believed 50 nodes had been started initially because I saw only 31 nodes were computing and sacct told me that 19 nodes failed for some reason. From the "node failure" message in slurm I deduced that at some point the nodes must have been running, which was a mistake, because after digging in the CloudWatch logs, I found the following message in a file called ip-SOMEIP.i-SOMETHING.slurm_resume:

    You have requested more vCPU capacity than your current vCPU limit of 32 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.

    I guess that should settle the question then --- I simply requested too many vCPUs. Thanks a lot for all the pointers.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions