- 新しい順
- 投票が多い順
- コメントが多い順
The logs in your Google Drive link show your problem, as you say it's because /sysroot won't mount
[ 4.259487] SGI XFS with ACLs, security attributes, no debug enabled
[ 4.269949] XFS (nvme0n1p3): Mounting V5 Filesystem
[ 4.325759] XFS (nvme0n1p3): Starting recovery (logdev: internal)
[ 4.346613] XFS (nvme0n1p3): Metadata CRC error detected at xfs_agf_read_verify+0xc7/0xf0 [xfs], xfs_agf block 0x13f47e1
[ 4.349599] XFS (nvme0n1p3): Unmount and run xfs_repair
[ 4.351075] XFS (nvme0n1p3): First 128 bytes of corrupted metadata buffer:
[ 4.352982] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 4.355218] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 4.357446] 00000020: 00 00 00 00 00 00 00 00 50 7e ed 07 81 7f 00 00 ........P~......
[ 4.359677] 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 4.361880] 00000040: 00 00 00 00 00 00 00 00 28 2a 01 28 77 7f 00 00 ........(*.(w...
[ 4.364117] 00000050: 88 be f1 6b 7c 7f 00 00 00 00 00 00 00 04 20 06 ...k|......... .
[ 4.366373] 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 4.368600] 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 4.370862] XFS (nvme0n1p3): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x13f47e1 len 1 error 74
[[0;1;31mFAILED[0m] Failed to mount /sysroot.
When did this happen, and do you have any backups or snapshots from before that? If you haven't got a backup then take one now before trying anything else, just in case it makes a bad situation worse.
The above doesn't look good and there may be nothing you can salvage from this, but if this was a physical server or a VM on a hypervisor like VMware you could get in on the console and start troubleshooting by booting off a CD or USB or ISO image. As you're on an EC2 you're going to have to skip straight to method 3 in https://repost.aws/knowledge-center/ec2-linux-emergency-mode which involves stopping your problem instance and detaching its root volume, then creating a new instance and attaching the problem root volume as the second disk (go as far as step 7).
Boot this rescue instance and then run xfs_repair against the slice with the corrupted filesystem (don't try to mount the filesystem before running it, not that you will be able to anyway).
There's no guarantee that xfs_repair will work, see this doc from Red Hat (I know you're running SUSE but XFS is still XFS) that shows a very similar set of error message to yours, which xfs_repair was unable to fix (full document needs an account, but you can see enough in the free view to get the gist of it) https://access.redhat.com/solutions/5950661
A google search of "xfs_trans_read_buf_map" "xfs_agf_read_verify"
(exactly that, including the quotes) will tell you a lot more about this.
We have formally submitted our case to AWS regarding "AWS EC2 Instance SAP Hana XFS File System Corruption". Below is their response for your review:
I understand that you have a SUSE Linux Enterprise Server 15 SP2 instance running SAP HANA v2 and that you are encountering a Metadata corruption failure. You have mentioned this issue has occurred multiple times in the past and you have resolved it each time with the 2 methods (running xfs_repair and restoring the instance to an older backup / snapshot). You are seeking a permanent solution to the issue. Please correct me if I am mistaken.
Thank you for providing the instance ID ( i-0223e819f2c1c5474). I have taken a look at the instance and can see it is currently passing both status checks and has been operational since 2023-11-08 02:43 UTC. I have viewed the underlying hardware for the instance and volumes for the past 90 days. There has been no faults or issues with the hardware. The issue is occurring in your environment.
Thank you for providing the '11-08-23 HANA Server log.txt' file. I have reviewed the logs and identified the following error messages related to the issue:
[ 5.697536] XFS (nvme0n1p3): Metadata corruption detected at xfs_agi_verify+0x3a/0x160 [xfs], xfs_agi block 0x1deebd2 [ 5.700389] XFS (nvme0n1p3): Unmount and run xfs_repair
As you mentioned, you have already run the appropriate steps in regards to the above error message to recover and repair the volume to restore the instance to operation.
I am unable to identify the root cause of why the Metadata corruption is occurring due to limited knowledge of your environment. AWS Premium Support has no visibility into the resources provisioned on your account, or the data stored on those resources. This is due to AWS’s strict data privacy and security policies that ensure that only customers have access to their data. Further information on this and the AWS Shared Model of Responsibility can be found here [1] [2].
I will suggest the following solutions you can attempt in order to have a permanent solution to your issue. It is highly recommended to make fresh backups / snapshots of your systems before performing any of the suggested solutions. It is also recommended to test any of these solutions in a test environment before moving to production.
- Update the SUSE Linux 15 OS to the latest Service Pack. Your instance is currently running on the SLES 15 SP2 operating system. The lifecycle for this Service Pack states that general support for this OS has ended on December 31st 2021. [3]. It is a general rule to keep your systems as up to date as possible. You can attempt to update your OS to the latest Service Pack and run the server. If the issue persists it will help us narrow down the cause.
Please note: SLES 15 SP2 cannot be directly updated to SP5. Please review the following guide [4] to determine the correct upgrade path required. SP2 will need to be updated first to SP3, then SP4 and then finally SP5 in consecutive order. Skipping service packs is not recommended. At the minimum you must upgrade to SP4 before attempting to upgrade to SP5.
- Ensure your current SUSE Linux OS is up to date.
If you do not want to update from SP2 to SP5. Please ensure the SP2 OS is up to date and running on the latest available kernel. The latest kernel for SLES 15 SP2 is 5.3.18-150200.24.169.1 (Released 06-Nov-2023).
You can run the following CLI command to view your current kernel: uname -r To update your SUSE Linux kernel, you can run the following CLI command: sudo zypper patch
This will update to the latest available version. For more information on the Zypper package manager, you can consult the following guide [5].
- Update the SAP System Kernel. Ensure your SAP HANA Service Pack 5 is fully updated. You can review the following official documentation for SAP on updating the System Kernel. [6].
I have found relevant documentations regarding XFS. While this page deal with SLES 12, the core principles will still apply. Please review the following documentation [7] regarding XFS Metadata corruption errors for possible solutions.
You can also reach and contact SUSE team directly at the following webpage [8] for support regarding your SUSE Linux OS and services.
If you have any further questions, please let me know via a reply to this case in the support center and I will be happy to help.
==========References==========
[1] Data Privacy FAQ - https://aws.amazon.com/compliance/data-privacy-faq/
[2] AWS Shared model of responsibility - https://aws.amazon.com/compliance/shared-responsibility-model/
[3] SUSE Linux Enterprise Server 15 Lifecycle - https://www.suse.com/lifecycle/#suse-linux-enterprise-server-15
[4] SLES Upgrade Path -https://documentation.suse.com/sles/15-SP5/html/SLES-all/cha-upgrade-paths.html
[5] SUSE Zypper package manager - https://documentation.suse.com/smart/systems-management/html/concept-zypper/index.html
[6] Update the SAP System kernel - https://help.sap.com/docs/SAP_LANDSCAPE_MANAGEMENT_ENTERPRISE/e7dead4286c545808b3bd24feee7448c/1741e69d60804d5ab85a10e307d70735.html
[7] XFS metadata corruption and invalid checksum on SAP Hana servers -https://www.suse.com/support/kb/doc/?id=000019192
[8] SUSE Direct Support - https://www.suse.com/contact/
We have have tried the following fixes: • Unmounting volumes and running xfs_repair utility • Replacing corrupted instances with a snapshot or backup of the volume. Although we are able to do these fixes with ease, we'd still like to have a permanent solution to our problem. This issue has occurred multiple times in the past and we have resolved it each time with the 2 methods (running xfs_repair and restoring the instance to an older backup / snapshot).
Thanks for the update, I had assumed your system was dead in the water.
Short of making sure that the kernel and other key patches are kept up-to-date I can't offer much more. This is a really low-level technical issue, if you have access to SUSE premium support I would suggest raising a call with them.
Accorging to the SUSE on AWS FAQ https://links.imagerelay.com/cdn/3404/ql/d26fccf0b5234c3d8595f46af2f70703/suse-public-cloud-faq-for-amazon-web-services.pdf (linked from https://www.suse.com/partners/alliance/aws/ )