Skip to content

EC2 Instance Reachability Check Failure with nfsd4 kernel panic

0

I have an EC2 g4dn.xlarge instance (headnode) running Ubuntu 20.04 via parallelcluster with nfs. Every now and then a peak in write operations is recorded and the instance becomes unreachable. Kernel version is 5.15.0-1057-aws

/var/log/syslog shows the following:

 slurmctld[954]: slurmctld: Warning: Note very large processing time from _slurmctld_background: usec=4317161 began=13:41:39.841
 kernel: [15347.337359] INFO: task kworker/u8:2:30243 blocked for more than 120 seconds.
 kernel: [15347.338702]       Tainted: G           OE     5.15.0-1057-aws #63~20.04.1-Ubuntu
 kernel: [15347.340079] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 kernel: [15347.341585] task:kworker/u8:2    state:D stack:    0 pid:30243 ppid:     2 flags:0x00004000
 kernel: [15347.341592] Workqueue: nfsd4 laundromat_main [nfsd]
 kernel: [15347.341629] Call Trace:
 kernel: [15347.341631]  <TASK>
 kernel: [15347.341633]  __schedule+0x2cd/0x890
 kernel: [15347.341639]  ? __smp_call_single_queue+0x59/0x90
 kernel: [15347.341644]  ? usleep_range_state+0x90/0x90
 kernel: [15347.341646]  schedule+0x69/0x110
 kernel: [15347.341650]  schedule_timeout+0x208/0x2d0
 kernel: [15347.341652]  ? try_to_wake_up+0x240/0x600
 kernel: [15347.341656]  ? fprop_fraction_percpu+0x34/0x80
 kernel: [15347.341661]  ? usleep_range_state+0x90/0x90
 kernel: [15347.341663]  __wait_for_common+0xb2/0x160
 kernel: [15347.341667]  wait_for_completion+0x24/0x30
 kernel: [15347.341671]  call_usermodehelper_exec+0x14c/0x180
 kernel: [15347.341675]  call_usermodehelper+0x93/0xc0
 kernel: [15347.341679]  nfsd4_umh_cltrack_upcall+0x8b/0x100 [nfsd]
 kernel: [15347.341707]  nfsd4_umh_cltrack_remove+0xba/0xf0 [nfsd]
 kernel: [15347.341733]  nfsd4_client_record_remove+0x44/0x50 [nfsd]
 kernel: [15347.341759]  expire_client+0x1b/0x30 [nfsd]
 kernel: [15347.341783]  nfs4_laundromat+0x223/0x720 [nfsd]
 kernel: [15347.341809]  laundromat_main+0x1a/0x40 [nfsd]
 kernel: [15347.341833]  process_one_work+0x22b/0x3d0
 kernel: [15347.341837]  worker_thread+0x4d/0x3f0
 kernel: [15347.341840]  ? process_one_work+0x3d0/0x3d0
 kernel: [15347.341843]  kthread+0x12a/0x150
 kernel: [15347.341847]  ? set_kthread_struct+0x50/0x50
 kernel: [15347.341850]  ret_from_fork+0x22/0x30
 kernel: [15347.341856]  </TASK>
asked 2 years ago638 views
2 Answers
0

I am not sure what you are looking for from the AWS community here. You are running open source OS and NFS.

First, validate CPU/memory usage in the instance at the time and look for cause. Beyond that, I would suggest, take an EBS snapshot (for possible recovery), then update OS and software patches (sudo apt update && sudo apt dist-upgrade).

Next option would be to seek support from Canonical (makers of Ubuntu).

Hope this helps!

AWS
EXPERT
answered 2 years ago
0

If you recently upgraded the kernel and are experiencing issues, you can try reverting back to a stable kernel version using the "rescue instance" method. Before doing so, ensure to take a snapshot of your instance as a backup. You can follow the steps provided in the AWS Knowledge Center article at

[+] https://repost.aws/knowledge-center/revert-stable-kernel-ec2-reboot.

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.