parallelcluster: nfs performance and cluster scalability - head node unresponsive

0

I am performing load/performance tests on my HPC. I am using parallelcluster 3.5.1 with slurm, head node is of c6i-2xlarge type and the queue of interest consists of 1024 c6i-large nodes. I have two EBS resources mounted, all results are written on-the-fly to one of them, the other one is the installation directory:

/dev/nvme1n1      50G   19G   29G  40% /install
/dev/nvme2n1     6.0T  4.3T  1.5T  75% /shared

I am queueing 550 identical single-threaded jobs that are expected to write less than 1 MB over the course of 20 minutes each. Jobs are submitted all at once, nodes start as expected and perform computations. Not long after nodes are started the docker and the gunicorn services on the head node go down, followed by the ssh session. Just before the ssh connection closes I get this error in the console:

Message from syslogd@ip-172-31-30-213 at May ...
kernel:[1645279.869683] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/u16:11:1053789]

The head node becomes unresponsive, even after the compute fleet completes computations and shuts down. Only restarting the head node via the EC2 console helps to get the ssh back up and running on the head node. Further inspection of the logs shows that the kern.log is flooded with messages related to the nfsd.

May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.741509] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.741946] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.745717] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.746760] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.747181] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.747713] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.749130] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.749582] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.750561] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:43 ip-172-31-30-213 kernel: [1035692.671422] rpc-srv/tcp: nfsd: got error -32 when sending 240 bytes - shutting down socket
May 24 12:08:44 ip-172-31-30-213 kernel: [1035693.868366] rpc-srv/tcp: nfsd: got error -32 when sending 240 bytes - shutting down socket
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.539966] rpc-srv/tcp: nfsd: got error -32 when sending 240 bytes - shutting down socket
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.740959] net_ratelimit: 4056 callbacks suppressed
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.740963] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.741555] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.742267] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.742885] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.743879] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.744676] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.746571] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.747328] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.749657] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.751490] nfsd: too many open connections, consider increasing the number of threads

550 jobs doesn't seem a massive load to me and I am surprised the head node becomes unstable so easily. Is there any configuration change you recommend me to do to my cluster to improve the scalability? Best, Michal

1 Answer
1
Accepted Answer

the first problem is related with number of threads

https://access.redhat.com/solutions/2216 (Please read this article)

The number of threads defaults to 8, but you may need more for your workload. This can be done by editing the /etc/sysconfig/nfs file and setting the RPCNFSDCOUNT to a higher value. he soft lockup error you're seeing suggests that a process on your head node was monopolizing the CPU and preventing other tasks from running. This could be related to the NFS server being overloaded. Addressing the NFS issue might alleviate this as well.

profile picture
EXPERT
answered a year ago
  • Thank you. Indeed, it seems that c6i-2xlarge has 16 threads set by default. Larger and smaller nodes have proportionally more or less. I quadrupled the number and the problem disappeared. Would it be an idea to add monitoring of nfsd to Cloudwatch?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions