unknown slowdown in parallelcluster
I've noticed that the amount of time to complete the jobs in my task array varies dramatically. Any idea what is causing it? The speed of the jobs seems very acceptable for the first jobs, but then something goes wrong ... ?
I'm using the slurm scheduler 20.11.8 and aws parallelcluster 3.0.2.
Below are 2 examples showing the variation in time/job. I plot the time (in seconds) it takes for each job/task (each job is a dot). (I couldn't see how to attach an image, so I'm providing links.)
example 1: 800 jobs [https://ibb.co/KrrwhXn](https://ibb.co/KrrwhXn)
You can see that the first ~400 tasks complete in roughly 400 seconds per job, and then jobs 400 to 750 take about 6000 seconds.
example 2: 300 jobs: [https://ibb.co/4RdTpzg](https://ibb.co/4RdTpzg)
You can see that the first 50 jobs run slower than jobs 50-150, and then jobs 150-200 are slowest.
In both cases I'm running 50 nodes at a time. It seems like the duration of the job is related to the number of jobs each instance has run. In other words, the speed of the task often changes considerably at each multiple of 50. When I change the number of nodes running at a time, I still observe this pattern. Each job is basically equal in the amount of "work" there is to do (within 5%), so it's *not* the case, for example, that jobs 150-200 in example 2 are "harder" than the other jobs. Actually the 2 examples above are the exact same jobs (but in example 2 I only ran the first 300 of 800 jobs).
What I've tried:
1. I've used different instance types, but I observe this slowdown across all instance types
2. I've used different number of nodes, but whether I use 20, 40, or 50, I observe this slowdown.
3. I've observed the CPU and memory usage in both the head node and nodes in the compute fleet, and it seems reasonable. when I use -top- to monitor, the highest usage process generally is using less than 1% of memory and 1% of CPU.
4. I've explored these logs in the **head** node, but I haven't found anything that's clearly wrong:
* /var/log/cfn-init.log
* /var/log/chef-client.log
* /var/log/parallelcluster/slurm_resume.log
* /var/log/parallelcluster/slurm_suspend.log
* /var/log/parallelcluster/clustermgtd
* /var/log/slurmctld.log
5. I've explored these logs in the **compute** node, but I haven't found anything that's clearly wrong:
* /var/log/cloud-init-output.log
* /var/log/parallelcluster/computemgtd
* /var/log/slurmd.log
Here's my configuration file:
```
Region: us-east-1
Image:
Os: alinux2
HeadNode:
CustomActions:
OnNodeConfigured:
Script: s3://my-bucket/head.sh
InstanceType: t2.medium
Networking:
SubnetId: [snip]
Ssh:
KeyName: [snip]
Scheduling:
Scheduler: slurm
SlurmQueues:
- Name: queue1
ComputeResources:
- Name: t2medium
InstanceType: t2.medium
MinCount: 0
MaxCount: 101
Networking:
SubnetIds:
- subnet-[snip]
CustomActions:
OnNodeConfigured:
Script: s3://my-bucket/node.sh
```
I'm limiting the number of nodes running (50) in the following way:
```
#!/bin/sh
#SBATCH --partition queue1
#SBATCH --array=1-800%50
#SBATCH --nice=100
```