Cluster created with ParallelCluster will not run jobs

Question

UPDATE: I answered this question for myself. I re-created the AMI but manually (following these docs: https://docs.aws.amazon.com/parallelcluster/latest/ug/pcluster.update-cluster-v3.html#modify-an-aws-parallelcluster-ami )  this time and it worked. 
Odd because the documentation cautions against this but it worked better than creating the AMI using pcluster.

Can't delete the question so here it is for the record.

I created a slurm cluster using AWS ParallelCluster (the `pcluster` tool). Creation works fine and I can ssh to the head node. But when I submit jobs they do not run. Using `srun`:

```
$ srun echo hello world
srun: error: Node failure on queue1-dy-t2micro-1
srun: Force Terminated job 1
srun: error: Job allocation 1 has been revoked
```

Using `sbatch`:

```
$ sbatch t.sh
Submitted batch job 2
$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    2    queue1     t.sh   ubuntu CF       0:02      1 queue1-dy-t2micro-2
```

Above it looks like it is going to start a job on host `queue1-dy-t2micro-2` but that host never comes up, or at least does not stay up, and after a little bit, I see this:

```
$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    2    queue1     t.sh   ubuntu PD       0:00      1 (BeginTime)
```

And then subsequently, the job is never run. 
Anyone know what is going on? I did use a custom AMI which I also built with pcluster, but I am not sure if that is the issue, because the head node comes up just fine and it is using the same AMI.

Answer

Hi!
`pcluster build-image` should work as well, it would be interesting to understand the reason because it seems that the compute nodes were not able to bootstrap correctly.

The logs of the cluster are preserved in CloudWatch Logs for 2 weeks so you could try to check in the `cfn-init` and `cloud-init*` log files of one of the failed compute instances (more info in [Integration with Amazon CloudWatch Logs](https://docs.aws.amazon.com/parallelcluster/latest/ug/cloudwatch-logs-v3.html) doc).

The next time you're facing issues and you have a running cluster, it is possible to investigate the root cause of the bootstrap failure by following the [troubleshooting guide](https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3.html#troubleshooting-v3-scaling-issues) and checking the logs of one of the failing instances.

Anyway glad to know that you've been able to find a working solution.
Enrico

Cluster created with ParallelCluster will not run jobs

Relevant content