I am performing load/performance tests on my HPC. I am using parallelcluster 3.5.1 with slurm, head node is of c6i-2xlarge
type and the queue of interest consists of 1024 c6i-large
nodes. I have two EBS resources mounted, all results are written on-the-fly to one of them, the other one is the installation directory:
/dev/nvme1n1 50G 19G 29G 40% /install
/dev/nvme2n1 6.0T 4.3T 1.5T 75% /shared
I am queueing 550 identical single-threaded jobs that are expected to write less than 1 MB over the course of 20 minutes each. Jobs are submitted all at once, nodes start as expected and perform computations.
Not long after nodes are started the docker and the gunicorn services on the head node go down, followed by the ssh session. Just before the ssh connection closes I get this error in the console:
Message from syslogd@ip-172-31-30-213 at May ...
kernel:[1645279.869683] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/u16:11:1053789]
The head node becomes unresponsive, even after the compute fleet completes computations and shuts down. Only restarting the head node via the EC2 console helps to get the ssh back up and running on the head node.
Further inspection of the logs shows that the kern.log
is flooded with messages related to the nfsd.
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.741509] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.741946] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.745717] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.746760] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.747181] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.747713] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.749130] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.749582] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.750561] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:43 ip-172-31-30-213 kernel: [1035692.671422] rpc-srv/tcp: nfsd: got error -32 when sending 240 bytes - shutting down socket
May 24 12:08:44 ip-172-31-30-213 kernel: [1035693.868366] rpc-srv/tcp: nfsd: got error -32 when sending 240 bytes - shutting down socket
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.539966] rpc-srv/tcp: nfsd: got error -32 when sending 240 bytes - shutting down socket
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.740959] net_ratelimit: 4056 callbacks suppressed
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.740963] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.741555] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.742267] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.742885] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.743879] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.744676] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.746571] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.747328] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.749657] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.751490] nfsd: too many open connections, consider increasing the number of threads
550 jobs doesn't seem a massive load to me and I am surprised the head node becomes unstable so easily. Is there any configuration change you recommend me to do to my cluster to improve the scalability?
Best,
Michal
Thank you. Indeed, it seems that c6i-2xlarge has 16 threads set by default. Larger and smaller nodes have proportionally more or less. I quadrupled the number and the problem disappeared. Would it be an idea to add monitoring of nfsd to Cloudwatch?