Skip to content

Status on Slurm from Solution deployed with Parallel Cluster V3 does not update to a R " running"

0

Dear community, I've been struggling for the last five days to resolve a problem with my CFD HPC solution using ParallelCluster. I've used it several times in the past, so I'm comfortable with the basic details. This time, I'm using a new instance, hpc7a (US-EAST2B Ohio), which is in a private subnet, because that instance does not allow public IP addresses. My headnode is a public subnet. I deeply love my CFD cluster without a problem. I can access the headnode and even the compute nodes, but when I submit the job, it remains in "CF" in the ST column, even after typing "slurm". I've used different OS and smaller cluster sizes, but nothing happened. Hopefully, I can get some feedback from you all.

When I submit the job, the job does not start, and sits in that status for more than 10 mins.

This is the configuration file:

HeadNode:
  InstanceType: c5.4xlarge
  Networking:
    SubnetId: subnet-ID <- Public Subnet
  Ssh:
    KeyName: KEYNAME
  LocalStorage:
    RootVolume:
      VolumeType: gp3
  Iam:
    S3Access:
      - BucketName: NAME OF BUCKET
        EnableWriteAccess: true
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
      - Policy: arn:aws:iam::aws:policy/AmazonEC2ReadOnlyAccess
  Dcv:
    Enabled: true
Scheduling:
  Scheduler: slurm
  SlurmQueues:
    - Name: compute
      CapacityType: ONDEMAND
      ComputeResources:
        - Name: hpc7a
          Instances:
            - InstanceType: hpc6a.48xlarge
          MinCount: 0
          MaxCount: 10
          Efa:
            Enabled: true
      Networking:
        SubnetIds:
          - subnet-ID <- Private Subnet
        PlacementGroup:
          Enabled: true
      ComputeSettings:
        LocalStorage:
          RootVolume:
            VolumeType: gp3
      Iam:
        S3Access:
            - BucketName: NAME OF BUCKET
              EnableWriteAccess: true
        AdditionalIamPolicies:
            - Policy: arn:aws:iam::aws:policy/AmazonEC2ReadOnlyAccess
#  SlurmSettings:
#    EnableMemoryBasedScheduling: true
Region: us-east-2
Image:
  Os: ubuntu2404
SharedStorage:
  - Name: FsxLustre1
    StorageType: FsxLustre
    MountDir: /fsx
    FsxLustreSettings:
      StorageCapacity: 2400
      DeploymentType: PERSISTENT_2
      DataCompressionType: LZ4
      PerUnitStorageThroughput: 250
asked 15 days ago30 views
2 Answers
1

The issue you're experiencing with your Slurm job being stuck in "CF" (CONFIGURING) state is likely related to how AWS ParallelCluster provisions dynamic compute nodes.

When you submit a job and the status shows "CF", this typically means that the compute nodes are in the process of powering up to handle your job. Since you're using a configuration with MinCount: 0 for your compute resources, ParallelCluster is dynamically provisioning the nodes only when jobs are submitted.

There are a few potential causes for your job being stuck in this state:

  1. Slurm daemon communication issues: After updating a cluster or changing its size, Slurm's internal communication hierarchy might not be properly updated. Slurm organizes its daemons in a tree hierarchy that's only updated when the daemons start.

  2. Network configuration issues: Since you're using a private subnet for compute nodes and a public subnet for the head node, there might be networking issues preventing proper communication between them.

  3. Instance availability: The hpc7a/hpc6a.48xlarge instances might not be immediately available in the requested Availability Zone.

To troubleshoot this issue:

  1. Try restarting the Slurm daemons using the scontrol reboot command, which will restart the compute nodes.

  2. Check the cluster logs for any error messages related to node provisioning.

  3. Verify that your VPC networking configuration allows proper communication between the public subnet (head node) and private subnet (compute nodes).

  4. Consider temporarily reducing the instance size or trying a different instance type to see if availability is the issue.

  5. Ensure that your IAM roles have the necessary permissions for node provisioning.

If a job is stuck in "CF" state, it generally means the dynamic nodes are still in the process of powering up, and you may need to wait longer or address underlying configuration issues.
Sources
Trying to run a job - AWS ParallelCluster
Running jobs in a multiple queue mode cluster - AWS ParallelCluster
AWS ParallelCluster processes - AWS ParallelCluster

answered 15 days ago
EXPERT
reviewed 13 days ago
0

To provide more context. I created two identical templates, one with HPC6A instances and another with HPC7A. The former allows me to put everything (headnode and compute nodes) on the same public subnet, while the latter does not. You can't use public IPs for HPC7a, and as a result, I have to use two different subnets. I ran my job with the same number of cores on the HPC6a (so it uses the same quota on the same region). That works, as expected. When I switch to the HPC7a template, which uses the two subnets, I can see the HPC7a instances spun up on my EC2 console, and they pass the status check and all that, but my job is never submitted to the compute nodes. I think there is an issue between the private and the public subnet... Any suggestions? For the record, I can access the compute nodes (private subnet) from the headnode (public subnet), so it is not an issue of inter-node communication. Using HPC7A is far more cost-effective than using HPC6A, so I must use HPC7A; it's a no-brainer.

answered 14 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.