- Newest
- Most votes
- Most comments
The issue you're experiencing with your Slurm job being stuck in "CF" (CONFIGURING) state is likely related to how AWS ParallelCluster provisions dynamic compute nodes.
When you submit a job and the status shows "CF", this typically means that the compute nodes are in the process of powering up to handle your job. Since you're using a configuration with MinCount: 0 for your compute resources, ParallelCluster is dynamically provisioning the nodes only when jobs are submitted.
There are a few potential causes for your job being stuck in this state:
-
Slurm daemon communication issues: After updating a cluster or changing its size, Slurm's internal communication hierarchy might not be properly updated. Slurm organizes its daemons in a tree hierarchy that's only updated when the daemons start.
-
Network configuration issues: Since you're using a private subnet for compute nodes and a public subnet for the head node, there might be networking issues preventing proper communication between them.
-
Instance availability: The hpc7a/hpc6a.48xlarge instances might not be immediately available in the requested Availability Zone.
To troubleshoot this issue:
-
Try restarting the Slurm daemons using the
scontrol reboot
command, which will restart the compute nodes. -
Check the cluster logs for any error messages related to node provisioning.
-
Verify that your VPC networking configuration allows proper communication between the public subnet (head node) and private subnet (compute nodes).
-
Consider temporarily reducing the instance size or trying a different instance type to see if availability is the issue.
-
Ensure that your IAM roles have the necessary permissions for node provisioning.
If a job is stuck in "CF" state, it generally means the dynamic nodes are still in the process of powering up, and you may need to wait longer or address underlying configuration issues.
Sources
Trying to run a job - AWS ParallelCluster
Running jobs in a multiple queue mode cluster - AWS ParallelCluster
AWS ParallelCluster processes - AWS ParallelCluster
To provide more context. I created two identical templates, one with HPC6A instances and another with HPC7A. The former allows me to put everything (headnode and compute nodes) on the same public subnet, while the latter does not. You can't use public IPs for HPC7a, and as a result, I have to use two different subnets. I ran my job with the same number of cores on the HPC6a (so it uses the same quota on the same region). That works, as expected. When I switch to the HPC7a template, which uses the two subnets, I can see the HPC7a instances spun up on my EC2 console, and they pass the status check and all that, but my job is never submitted to the compute nodes. I think there is an issue between the private and the public subnet... Any suggestions? For the record, I can access the compute nodes (private subnet) from the headnode (public subnet), so it is not an issue of inter-node communication. Using HPC7A is far more cost-effective than using HPC6A, so I must use HPC7A; it's a no-brainer.
Relevant content
- asked 4 years ago
- asked 2 years ago