AWS EC2 Instance SAP Hana XFS File System Corruption

0

Our AWS EC2 instance SAP Hana is currently inaccessible. Analysis of the system logs indicates multiple critical errors during the boot process, primarily related to file system corruption and service startup failures.

Identified Issues

  1. XFS File System Corruption
  • Description: The logs show a Metadata CRC error and an I/O error in the XFS file system.

  • Impact: Such errors typically prevent the operating system from accessing vital file system data, leading to boot failures or system instability.

  1. Failed Service Startups
  • Failed Services: Critical failures in starting services like 'Setup Virtual Console'.

  • Dependency Failures: Issues in 'dracut ask for additional cmdline parameters' and 'Reload Configuration from the Real Root'.

  • Impact: Failure in starting these services impedes the initialization process, leading to an inability to access the instance.

  1. Mounting Failure of /sysroot
  • Description: The system failed to mount /sysroot, a crucial step in the boot process.

  • Impact: This failure is a critical blocker for the boot process, rendering the system unusable.

  1. Correctable Errors Collector Initialization
  • Description: RAS Correctable Errors collector was initialized.

  • Potential Indication: This could indicate underlying stability issues, potentially at the hardware level.

  1. Keylock Active Warning
  • Description: A warning regarding keylock being active.

  • Relevance: While not directly related to the access issue, it indicates potential configuration or input device issues.

OS: NAME="SLES"

VERSION="15-SP2"

VERSION_ID="15.2"

PRETTY_NAME="SUSE Linux Enterprise Server 15 SP2"

ID="sles"

ID_LIKE="suse"

ANSI_COLOR="0;32"

CPE_NAME="cpe:/o:suse:sles:15:sp2"

Linux 5.3.18-22-default #1 SMP Wed Jun 3 12:16:43 UTC 2020 (720aeba)

this is the aws logs:

https://drive.google.com/file/d/1GSyzCNAKh4SPJ27hi7vVcFmKm_YWyyJf/view?usp=sharing

Any help would be appreciated.

asked 5 months ago331 views
2 Answers
0

The logs in your Google Drive link show your problem, as you say it's because /sysroot won't mount

[    4.259487] SGI XFS with ACLs, security attributes, no debug enabled
[    4.269949] XFS (nvme0n1p3): Mounting V5 Filesystem
[    4.325759] XFS (nvme0n1p3): Starting recovery (logdev: internal)
[    4.346613] XFS (nvme0n1p3): Metadata CRC error detected at xfs_agf_read_verify+0xc7/0xf0 [xfs], xfs_agf block 0x13f47e1 
[    4.349599] XFS (nvme0n1p3): Unmount and run xfs_repair
[    4.351075] XFS (nvme0n1p3): First 128 bytes of corrupted metadata buffer:
[    4.352982] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    4.355218] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    4.357446] 00000020: 00 00 00 00 00 00 00 00 50 7e ed 07 81 7f 00 00  ........P~......
[    4.359677] 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    4.361880] 00000040: 00 00 00 00 00 00 00 00 28 2a 01 28 77 7f 00 00  ........(*.(w...
[    4.364117] 00000050: 88 be f1 6b 7c 7f 00 00 00 00 00 00 00 04 20 06  ...k|......... .
[    4.366373] 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    4.368600] 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[    4.370862] XFS (nvme0n1p3): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x13f47e1 len 1 error 74
[FAILED] Failed to mount /sysroot.

When did this happen, and do you have any backups or snapshots from before that? If you haven't got a backup then take one now before trying anything else, just in case it makes a bad situation worse.

The above doesn't look good and there may be nothing you can salvage from this, but if this was a physical server or a VM on a hypervisor like VMware you could get in on the console and start troubleshooting by booting off a CD or USB or ISO image. As you're on an EC2 you're going to have to skip straight to method 3 in https://repost.aws/knowledge-center/ec2-linux-emergency-mode which involves stopping your problem instance and detaching its root volume, then creating a new instance and attaching the problem root volume as the second disk (go as far as step 7).

Boot this rescue instance and then run xfs_repair against the slice with the corrupted filesystem (don't try to mount the filesystem before running it, not that you will be able to anyway).

There's no guarantee that xfs_repair will work, see this doc from Red Hat (I know you're running SUSE but XFS is still XFS) that shows a very similar set of error message to yours, which xfs_repair was unable to fix (full document needs an account, but you can see enough in the free view to get the gist of it) https://access.redhat.com/solutions/5950661

A google search of "xfs_trans_read_buf_map" "xfs_agf_read_verify" (exactly that, including the quotes) will tell you a lot more about this.

profile picture
EXPERT
Steve_M
answered 5 months ago
  • We have have tried the following fixes: • Unmounting volumes and running xfs_repair utility • Replacing corrupted instances with a snapshot or backup of the volume. Although we are able to do these fixes with ease, we'd still like to have a permanent solution to our problem. This issue has occurred multiple times in the past and we have resolved it each time with the 2 methods (running xfs_repair and restoring the instance to an older backup / snapshot).

  • Thanks for the update, I had assumed your system was dead in the water.

    Short of making sure that the kernel and other key patches are kept up-to-date I can't offer much more. This is a really low-level technical issue, if you have access to SUSE premium support I would suggest raising a call with them.

    Accorging to the SUSE on AWS FAQ https://links.imagerelay.com/cdn/3404/ql/d26fccf0b5234c3d8595f46af2f70703/suse-public-cloud-faq-for-amazon-web-services.pdf (linked from https://www.suse.com/partners/alliance/aws/ )

    14. How does support work for PAYG Instances of SUSE Linux Enterprise Server products on Amazon EC2?

    SUSE Linux Enterprise Server on Amazon EC2 is covered under Premium Support. Premium Support customers that contact AWS for help will work directly with AWS to resolve issues that are related to SUSE Linux Enterprise Server. Amazon and SUSE engineering teams will work together to resolve any SUSE issue that requires escalation. See the SUSE Technical Support Handbook for more info. This applied to all launch instance types, including Spot, Reserved, and On-Demand.

0

We have formally submitted our case to AWS regarding "AWS EC2 Instance SAP Hana XFS File System Corruption". Below is their response for your review:

I understand that you have a SUSE Linux Enterprise Server 15 SP2 instance running SAP HANA v2 and that you are encountering a Metadata corruption failure. You have mentioned this issue has occurred multiple times in the past and you have resolved it each time with the 2 methods (running xfs_repair and restoring the instance to an older backup / snapshot). You are seeking a permanent solution to the issue. Please correct me if I am mistaken.

Thank you for providing the instance ID ( i-0223e819f2c1c5474). I have taken a look at the instance and can see it is currently passing both status checks and has been operational since 2023-11-08 02:43 UTC. I have viewed the underlying hardware for the instance and volumes for the past 90 days. There has been no faults or issues with the hardware. The issue is occurring in your environment.

Thank you for providing the '11-08-23 HANA Server log.txt' file. I have reviewed the logs and identified the following error messages related to the issue:

[ 5.697536] XFS (nvme0n1p3): Metadata corruption detected at xfs_agi_verify+0x3a/0x160 [xfs], xfs_agi block 0x1deebd2 [ 5.700389] XFS (nvme0n1p3): Unmount and run xfs_repair

As you mentioned, you have already run the appropriate steps in regards to the above error message to recover and repair the volume to restore the instance to operation.

I am unable to identify the root cause of why the Metadata corruption is occurring due to limited knowledge of your environment. AWS Premium Support has no visibility into the resources provisioned on your account, or the data stored on those resources. This is due to AWS’s strict data privacy and security policies that ensure that only customers have access to their data. Further information on this and the AWS Shared Model of Responsibility can be found here [1] [2].

I will suggest the following solutions you can attempt in order to have a permanent solution to your issue. It is highly recommended to make fresh backups / snapshots of your systems before performing any of the suggested solutions. It is also recommended to test any of these solutions in a test environment before moving to production.

  1. Update the SUSE Linux 15 OS to the latest Service Pack. Your instance is currently running on the SLES 15 SP2 operating system. The lifecycle for this Service Pack states that general support for this OS has ended on December 31st 2021. [3]. It is a general rule to keep your systems as up to date as possible. You can attempt to update your OS to the latest Service Pack and run the server. If the issue persists it will help us narrow down the cause.

Please note: SLES 15 SP2 cannot be directly updated to SP5. Please review the following guide [4] to determine the correct upgrade path required. SP2 will need to be updated first to SP3, then SP4 and then finally SP5 in consecutive order. Skipping service packs is not recommended. At the minimum you must upgrade to SP4 before attempting to upgrade to SP5.

  1. Ensure your current SUSE Linux OS is up to date.

If you do not want to update from SP2 to SP5. Please ensure the SP2 OS is up to date and running on the latest available kernel. The latest kernel for SLES 15 SP2 is 5.3.18-150200.24.169.1 (Released 06-Nov-2023).

You can run the following CLI command to view your current kernel: uname -r To update your SUSE Linux kernel, you can run the following CLI command: sudo zypper patch

This will update to the latest available version. For more information on the Zypper package manager, you can consult the following guide [5].

  1. Update the SAP System Kernel. Ensure your SAP HANA Service Pack 5 is fully updated. You can review the following official documentation for SAP on updating the System Kernel. [6].

I have found relevant documentations regarding XFS. While this page deal with SLES 12, the core principles will still apply. Please review the following documentation [7] regarding XFS Metadata corruption errors for possible solutions.

You can also reach and contact SUSE team directly at the following webpage [8] for support regarding your SUSE Linux OS and services.

If you have any further questions, please let me know via a reply to this case in the support center and I will be happy to help.

==========References==========

[1] Data Privacy FAQ - https://aws.amazon.com/compliance/data-privacy-faq/

[2] AWS Shared model of responsibility - https://aws.amazon.com/compliance/shared-responsibility-model/

[3] SUSE Linux Enterprise Server 15 Lifecycle - https://www.suse.com/lifecycle/#suse-linux-enterprise-server-15

[4] SLES Upgrade Path -https://documentation.suse.com/sles/15-SP5/html/SLES-all/cha-upgrade-paths.html

[5] SUSE Zypper package manager - https://documentation.suse.com/smart/systems-management/html/concept-zypper/index.html

[6] Update the SAP System kernel - https://help.sap.com/docs/SAP_LANDSCAPE_MANAGEMENT_ENTERPRISE/e7dead4286c545808b3bd24feee7448c/1741e69d60804d5ab85a10e307d70735.html

[7] XFS metadata corruption and invalid checksum on SAP Hana servers -https://www.suse.com/support/kb/doc/?id=000019192

[8] SUSE Direct Support - https://www.suse.com/contact/

answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions