parallelcluster: nfs performance and cluster scalability - head node unresponsive

0

I am performing load/performance tests on my HPC. I am using parallelcluster 3.5.1 with slurm, head node is of c6i-2xlarge type and the queue of interest consists of 1024 c6i-large nodes. I have two EBS resources mounted, all results are written on-the-fly to one of them, the other one is the installation directory:

/dev/nvme1n1      50G   19G   29G  40% /install
/dev/nvme2n1     6.0T  4.3T  1.5T  75% /shared

I am queueing 550 identical single-threaded jobs that are expected to write less than 1 MB over the course of 20 minutes each. Jobs are submitted all at once, nodes start as expected and perform computations. Not long after nodes are started the docker and the gunicorn services on the head node go down, followed by the ssh session. Just before the ssh connection closes I get this error in the console:

Message from syslogd@ip-172-31-30-213 at May ...
kernel:[1645279.869683] watchdog: BUG: soft lockup - CPU#2 stuck for 23s! [kworker/u16:11:1053789]

The head node becomes unresponsive, even after the compute fleet completes computations and shuts down. Only restarting the head node via the EC2 console helps to get the ssh back up and running on the head node. Further inspection of the logs shows that the kern.log is flooded with messages related to the nfsd.

May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.741509] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.741946] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.745717] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.746760] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.747181] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.747713] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.749130] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.749582] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:40 ip-172-31-30-213 kernel: [1035689.750561] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:43 ip-172-31-30-213 kernel: [1035692.671422] rpc-srv/tcp: nfsd: got error -32 when sending 240 bytes - shutting down socket
May 24 12:08:44 ip-172-31-30-213 kernel: [1035693.868366] rpc-srv/tcp: nfsd: got error -32 when sending 240 bytes - shutting down socket
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.539966] rpc-srv/tcp: nfsd: got error -32 when sending 240 bytes - shutting down socket
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.740959] net_ratelimit: 4056 callbacks suppressed
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.740963] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.741555] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.742267] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.742885] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.743879] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.744676] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.746571] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.747328] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.749657] nfsd: too many open connections, consider increasing the number of threads
May 24 12:08:45 ip-172-31-30-213 kernel: [1035694.751490] nfsd: too many open connections, consider increasing the number of threads

550 jobs doesn't seem a massive load to me and I am surprised the head node becomes unstable so easily. Is there any configuration change you recommend me to do to my cluster to improve the scalability? Best, Michal

질문됨 일 년 전386회 조회
1개 답변
1
수락된 답변

the first problem is related with number of threads

https://access.redhat.com/solutions/2216 (Please read this article)

The number of threads defaults to 8, but you may need more for your workload. This can be done by editing the /etc/sysconfig/nfs file and setting the RPCNFSDCOUNT to a higher value. he soft lockup error you're seeing suggests that a process on your head node was monopolizing the CPU and preventing other tasks from running. This could be related to the NFS server being overloaded. Addressing the NFS issue might alleviate this as well.

profile picture
전문가
답변함 일 년 전
  • Thank you. Indeed, it seems that c6i-2xlarge has 16 threads set by default. Larger and smaller nodes have proportionally more or less. I quadrupled the number and the problem disappeared. Would it be an idea to add monitoring of nfsd to Cloudwatch?

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠