Jobs submitted to worker nodes stay in pending after rebooting head node

0

Hello,

I have set up a small cluster using Parallel Cluster where I have head and worker nodes in the same public network.

I used it yesterday, and it worked well. I submitted jobs using Slurm to on-demand and spot instances without any issues. When these jobs finished, I stopped the head node instance. After restarting it, every newly submitted jobs remain in a pending state. When checking their status using squeue, I see errors like "ReqNodeNotAvail" and "UnavailableNode." I have attempted to submit jobs to different types of on-demand instances, but the issue persists.

Has anything changed in the configuration after the reboot? If so, what is the best way to stop an instance if I am planning to use it in the future?

asked 10 months ago306 views
2 Answers
0

https://stackoverflow.com/questions/50661730/slurm-how-to-restart-failed-worker-job

You can use --requeue

#SBATCH --requeue                   ### On failure, requeue for another try

--requeue Specifies that the batch job should eligible to being requeue. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning. Also see the --no-requeue option. The JobRequeue configuration parameter controls the default behavior on the cluster.

profile picture
EXPERT
answered 10 months ago
  • Thank you very much for your answer. I did not know about the --requeue option. Unfortunately, that did not fix the problem as the jobs remain in the pending state once again. I created a new cluster using the same config file for parallel cluster and it works perfectly fine. All jobs are being directed to the specified worker nodes without any issues. Therefore, the problem only arises when I stop the head node instance and subsequently restart it. Perhaps it is not intended to be used like that, and I cannot simply connect and disconnect it whenever I need to utilize it.

  • Maybe it is a bug or something maybe you ca raise to a ticket directly to the AWS Support

0

Hi @cortezero

please refer to official documentation, in particular be sure to stop the compute fleet before stopping the head node.

[...] you can manually stop and start an instance that doesn't have instance stores. For this case and for other cases of instances without ephemeral volumes, continue to Stop and start a cluster's head node. If your instance has ephemeral drives and its been stopped, the data in the instance store is lost. You can determine if the instance type used for the head node has instance stores from the table found in Instance store volumes.

Source: https://docs.aws.amazon.com/parallelcluster/latest/ug/instance-updates-ami-patch-v3.html#instance-updates-headnode-v3

AWS
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions