Jobs submitted to worker nodes stay in pending after rebooting head node

0

Hello,

I have set up a small cluster using Parallel Cluster where I have head and worker nodes in the same public network.

I used it yesterday, and it worked well. I submitted jobs using Slurm to on-demand and spot instances without any issues. When these jobs finished, I stopped the head node instance. After restarting it, every newly submitted jobs remain in a pending state. When checking their status using squeue, I see errors like "ReqNodeNotAvail" and "UnavailableNode." I have attempted to submit jobs to different types of on-demand instances, but the issue persists.

Has anything changed in the configuration after the reboot? If so, what is the best way to stop an instance if I am planning to use it in the future?

已提問 10 個月前檢視次數 313 次
2 個答案
0

https://stackoverflow.com/questions/50661730/slurm-how-to-restart-failed-worker-job

You can use --requeue

#SBATCH --requeue                   ### On failure, requeue for another try

--requeue Specifies that the batch job should eligible to being requeue. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning. Also see the --no-requeue option. The JobRequeue configuration parameter controls the default behavior on the cluster.

profile picture
專家
已回答 10 個月前
  • Thank you very much for your answer. I did not know about the --requeue option. Unfortunately, that did not fix the problem as the jobs remain in the pending state once again. I created a new cluster using the same config file for parallel cluster and it works perfectly fine. All jobs are being directed to the specified worker nodes without any issues. Therefore, the problem only arises when I stop the head node instance and subsequently restart it. Perhaps it is not intended to be used like that, and I cannot simply connect and disconnect it whenever I need to utilize it.

  • Maybe it is a bug or something maybe you ca raise to a ticket directly to the AWS Support

0

Hi @cortezero

please refer to official documentation, in particular be sure to stop the compute fleet before stopping the head node.

[...] you can manually stop and start an instance that doesn't have instance stores. For this case and for other cases of instances without ephemeral volumes, continue to Stop and start a cluster's head node. If your instance has ephemeral drives and its been stopped, the data in the instance store is lost. You can determine if the instance type used for the head node has instance stores from the table found in Instance store volumes.

Source: https://docs.aws.amazon.com/parallelcluster/latest/ug/instance-updates-ami-patch-v3.html#instance-updates-headnode-v3

AWS
已回答 10 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南