- Newest
- Most votes
- Most comments
It looks like you're only using 31 nodes and the other 19 aren't in use. What does your workload look like (i.e. running a job that needs 50 nodes, running 50 jobs that need 1 node)? ParallelCluster will only provision instances needed to run jobs in the queue to leverage the elasticity of the cloud and ensure you're not paying for idle resources.
If you go to CloudWatch, there should be a CloudWatch log group with the name /aws/parallelcluster/<StackName>
where StackName is the name of your ParallelCluster stack. There should be logs for each node in there if they were provisioned then failed. If they failed before provisioning, you'll want to look at the logs for the head node, there may be something useful in the slurm_resume.log
entry. Failing before provisioning may also be indicative of an error such as exceeding your quota of instances (which you could increase here: https://aws.amazon.com/premiumsupport/knowledge-center/ec2-instance-limit/)
Relevant content
- asked 7 months ago
- asked 5 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 9 months ago
- AWS OFFICIALUpdated 3 months ago
Thanks for the answer. I have a queue full of hundreds of pending jobs in this partition, each of which would run on one node, but that doesn't prompt ParallelCluster to restart the failed nodes.
Gotcha. I think the
idle%
designation means the partition is idle and powering down (https://slurm.schedmd.com/elastic_computing.html). Do the EC2 instances that would be in that partition exist? Did they launch then terminate? Were you able to glean any useful information from the CloudWatch logs?Pointing to the CloudWatch logs was spot on, thanks. When I first checked the status of my job queue, I believed 50 nodes had been started initially because I saw only 31 nodes were computing and
sacct
told me that 19 nodes failed for some reason. From the "node failure" message in slurm I deduced that at some point the nodes must have been running, which was a mistake, because after digging in the CloudWatch logs, I found the following message in a file calledip-SOMEIP.i-SOMETHING.slurm_resume
:You have requested more vCPU capacity than your current vCPU limit of 32 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.
I guess that should settle the question then --- I simply requested too many vCPUs. Thanks a lot for all the pointers.