Can't get FSx for Lustre client working on Rocky Linux 8.7

0

Hello,

I'm trying to get the FSx for Lustre client working on Rocky Linux 8.7. I've read this page:

https://docs.aws.amazon.com/fsx/latest/LustreGuide/install-lustre-client.html

...where it also states that my kernel 4.18.0-425 should be compatible. So I followed the guide under the section:

CentOS, Rocky Linux, and Red Hat / To install the Lustre client on CentOS and Red Hat 8.2 and newer or on Rocky Linux 8.4 and newer

I can get through all the steps successfully, including actually installing the client, but when I try to mount the FSx volume via systemd like this

sudo systemctl enable mnt-fsx.mount
sudo systemctl start mnt-fsx.mount

It results in a soft lockup:

Message from syslogd@flame01 at Jul 17 07:15:46 ... kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [pmdalinux:27713]

Then the instance becomes unresponsive... can't connect to it after a forced reboot.

My mount unit file looks like this:

#!/bin/bash

[Unit]
Description=Mount FSx file system
Requires=network-online.target
After=network-online.target

[Mount]
What=fs-xxxxxxxxxxxx.fsx.<region>.amazonaws.com:/mountname
Where=/mnt/fsx
Type=lustre
Options=defaults,_netdev

[Install]
WantedBy=multi-user.target

(Yep, confirmed that DNSName and LustreMountName are correct in the CloudFormation JSON before creating the stack.)

I can get this all working by following the similar steps for my Ubuntu instance, so I'm thinking it's something wrong I'm doing due to my unfamiliarity with Rocky. Appreciate any guidance here from the community!

Thanks very much, Ean

asked 9 months ago554 views
3 Answers
0
Accepted Answer

Ok, with AWS support's help, turns out I had to update the kernel then all good.

answered 7 months ago
0

You're running Rocky Linux 8.7 - according to https://docs.aws.amazon.com/fsx/latest/LustreGuide/prerequisites.html FSx for Lustre is only supported up to 8.6 (whether it's RHEL, CentOS or Rocky):

CentOS and Red Hat Enterprise Linux 7.5 through 7.9 and 8.2 through 8.6, Rocky Linux 8.4 through 8.6

Could you try it with an older version of the AMI and see if it makes a difference?

profile picture
EXPERT
Steve_M
answered 9 months ago
  • Hi RWC, oh, nice spot. That conflicts with the compatibility matrix in the first link I sent which says it should be compatible with 8.7. Hmm. Ok, will take some work to re-do my environment with 8.6 but will give it a go. Thanks for the tip.

  • No worries. And if you get the same outcome on Rocky 8.6 (and my hunch is that you could well do) then give it a try on RHEL 8.6 - as Rocky claims to be 100% bug-for-bug compatible with RHEL you should get the same outcome, whether good or bad.

    If you get the same problem with RHEL, the fact you're paying for it (even just a few cents) gives you the option of logging a support call through AWS Premium Support, who in turn will engage Red Hat on your behalf https://aws.amazon.com/partners/redhat/faqs/#Support Between the two of them you should get a resolution.

  • Hey Steve,

    My contact at AWS says 8.7 is compatible. It's working for him, but still not for me. I took your suggestion and tried a RHEL 8.7 AMI and that's also not working, but in a slightly different way:

    [ec2-user@ip-172-31-42-103 ~]$ sudo mount -t lustre -o noatime,flock fs-00b4d8fbf28ff3fa7.fsx.ap-southeast-1.amazonaws.com@tcp:/b3kplbmv /mnt/fsx mount.lustre: mount fs-00b4d8fbf28ff3fa7.fsx.ap-southeast-1.amazonaws.com@tcp:/b3kplbmv at /mnt/fsx failed: No such device Are the lustre modules loaded? Check /etc/modprobe.conf and /proc/filesystems

    Then I tried:

    lsmod | grep lustre

    That returns nothing. Then I tried:

    [ec2-user@ip-172-31-42-103 ~]$ sudo modprobe lustre modprobe: ERROR: could not insert 'lustre': Device or resource busy

    And now I'm a bit out of my depth.

    I couldn't see how to lodge a support ticket with AWS at that link even though it does seem like support should be included with RHEL. Hmm.

0

Hi,

A 'soft lockup' is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds without giving other tasks a chance to run. The watchdog daemon will send an non-maskable interrupt (NMI) to all CPUs in the system who, in turn, print the stack traces of their currently running tasks.

Given the end of your error message: this problem seems to be reported for PMDA process. See https://manpages.ubuntu.com/manpages/focal/man1/pmdaproc.1.html

You may want to try to stop this process. But, no guarantee: PMDA may just be the detector of of the lockup not its cause...

You may also want to open an AWS Support ticket.

Best,

Didier

profile pictureAWS
EXPERT
answered 9 months ago
  • Hi Didier, thanks for that explanation. I indeed learned the same about "soft lockups" googling for an answer. Reasonably certain it's caused by the FSx client install. Alas, I can't open an AWS Support ticket since I'm not on a support plan.

  • Hi RWC, ok. Maybe the Rocky Linux community can help?

  • Have posted over on the Rocky support forum. Thanks for the idea Didier

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions