Cluster created with ParallelCluster will not run jobs

0

UPDATE: I answered this question for myself. I re-created the AMI but manually (following these docs: https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.update-cluster-v3.html#modify-an-aws-parallelcluster-ami ) this time and it worked. Odd because the documentation cautions against this but it worked better than creating the AMI using pcluster.

Can't delete the question so here it is for the record.

I created a slurm cluster using AWS ParallelCluster (the pcluster tool). Creation works fine and I can ssh to the head node. But when I submit jobs they do not run. Using srun:

$ srun echo hello world
srun: error: Node failure on queue1-dy-t2micro-1
srun: Force Terminated job 1
srun: error: Job allocation 1 has been revoked

Using sbatch:

$ sbatch t.sh
Submitted batch job 2
$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    2    queue1     t.sh   ubuntu CF       0:02      1 queue1-dy-t2micro-2

Above it looks like it is going to start a job on host queue1-dy-t2micro-2 but that host never comes up, or at least does not stay up, and after a little bit, I see this:

$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    2    queue1     t.sh   ubuntu PD       0:00      1 (BeginTime)

And then subsequently, the job is never run. Anyone know what is going on? I did use a custom AMI which I also built with pcluster, but I am not sure if that is the issue, because the head node comes up just fine and it is using the same AMI.

asked 2 years ago702 views
1 Answer
0

Hi! pcluster build-image should work as well, it would be interesting to understand the reason because it seems that the compute nodes were not able to bootstrap correctly.

The logs of the cluster are preserved in CloudWatch Logs for 2 weeks so you could try to check in the cfn-init and cloud-init* log files of one of the failed compute instances (more info in Integration with Amazon CloudWatch Logs doc).

The next time you're facing issues and you have a running cluster, it is possible to investigate the root cause of the bootstrap failure by following the troubleshooting guide and checking the logs of one of the failing instances.

Anyway glad to know that you've been able to find a working solution. Enrico

AWS
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions