- Newest
- Most votes
- Most comments
https://stackoverflow.com/questions/50661730/slurm-how-to-restart-failed-worker-job
You can use --requeue
#SBATCH --requeue ### On failure, requeue for another try
--requeue Specifies that the batch job should eligible to being requeue. The job may be requeued explicitly by a system administrator, after node failure, or upon preemption by a higher priority job. When a job is requeued, the batch script is initiated from its beginning. Also see the --no-requeue option. The JobRequeue configuration parameter controls the default behavior on the cluster.
Hi @cortezero
please refer to official documentation, in particular be sure to stop the compute fleet before stopping the head node.
[...] you can manually stop and start an instance that doesn't have instance stores. For this case and for other cases of instances without ephemeral volumes, continue to Stop and start a cluster's head node. If your instance has ephemeral drives and its been stopped, the data in the instance store is lost. You can determine if the instance type used for the head node has instance stores from the table found in Instance store volumes.
Relevant content
- asked 2 years ago
- Accepted Answerasked a year ago
- asked 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 7 months ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a day ago
Thank you very much for your answer. I did not know about the --requeue option. Unfortunately, that did not fix the problem as the jobs remain in the pending state once again. I created a new cluster using the same config file for parallel cluster and it works perfectly fine. All jobs are being directed to the specified worker nodes without any issues. Therefore, the problem only arises when I stop the head node instance and subsequently restart it. Perhaps it is not intended to be used like that, and I cannot simply connect and disconnect it whenever I need to utilize it.
Maybe it is a bug or something maybe you ca raise to a ticket directly to the AWS Support