How do I troubleshoot an EC2 Linux instance that failed the instance status check due to operating system issues?

9 minute read
2

My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance failed the instance status check because of operating system issues. Now it doesn't boot successfully.

Short description

Your EC2 Linux instance might fail the instance status check for the following reasons:

  • You updated the kernel and the new kernel didn't boot.
  • The file system entries in /etc/fstab are incorrect or the file system is corrupted.
  • There are incorrect network configurations on the instance.

Resolution

Important: Some of the following procedures require you to stop the instance. When you stop an instance, you lose data that's stored in instance store volumes. Save a backup of the data before you stop the instance. Unlike Amazon Elastic Block Store (Amazon EBS)-backed volumes, instance store volumes are ephemeral and don't support data persistence. For more information, see Stop and start Amazon EC2 instances.

The static public IPv4 address that Amazon EC2 automatically assigns to the instance changes after the stop and start. To retain a public IPv4 address that doesn't change when the instance is stopped, use an Elastic IP address.

Note: The following methods use examples that are based on Amazon Linux 2. However, these concepts apply to Linux distributions in general. If you have a Linux distribution other than Amazon Linux 2, then the commands, paths, and outputs might vary.

Use the EC2 serial console for Linux instances

If you turned on the EC2 serial console for Linux instances, then you can use it to troubleshoot supported Nitro-based instance types and bare metal instances. The serial console helps you troubleshoot boot issues and network and SSH configuration issues. The serial console can connect to your instance without a working network connection. To access the serial console, use the Amazon EC2 console or the AWS Command Line Interface (AWS CLI).

The first time that you use the EC2 serial console, review the prerequisites and configure access before you connect.

If your instance is unreachable and you didn't configure access to the serial console, then see the Run the EC2Rescue for Linux tool section. Or, see Use a rescue instance. To configure the EC2 serial console for Linux instances, see Configure access to the EC2 serial console.

Note: If you receive errors when you run AWS CLI commands, see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.

Run the EC2Rescue for Linux tool

EC2Rescue for Linux automatically diagnoses and troubleshoots operating systems on unreachable instances. For more information, see How do I use EC2Rescue for Linux to troubleshoot operating system-level issues?

Use a rescue instance to manually correct errors

  1. Launch a new EC2 instance in your virtual private cloud (VPC). Use the same Amazon Machine Image (AMI) and the same Availability Zone as the impaired instance. The new instance becomes your rescue instance.

    Or, use an existing instance. The existing instance must use the same AMI and be in the same Availability Zone as your impaired instance.

  2. Stop the impaired instance.

  3. Detach the Amazon EBS root volume (/dev/xvda or /dev/sda1) from your impaired instance. Note the device name of your root volume.

  4. Attach the volume as a secondary device (/dev/sdf) to the rescue instance.

  5. Connect to your rescue instance through SSH.

  6. Create a mount point directory (/rescue) for the new volume attached to the rescue instance:

    $ sudo mkdir /rescue
  7. Mount the volume at the new directory:

    $ sudo mount /dev/xvdf1 /rescue

    If you receive an error, such as "Wrong Fs type or UUID duplicate, Superblock is missing or badblock found," see Why can't I mount my Amazon EBS volume?

    Note: The device (/dev/xvdf1) might be attached to the rescue instance with a different device name. To determine the correct device name, run the lsblk command to view your available disk devices along with their mount points.

  8. If you haven't already done so, retrieve the system log of the instance to verify the error. The next steps depend on the error message listed in the system log.

    The following is a list of common errors that cause instance status check failure. For more errors, see Troubleshoot system log errors for Linux-based instances.

Troubleshoot "Kernel panic"

If a Kernel Panic error message is in the system log, then the kernel might not have the vmlinuz or initramfs files. The vmlinuz and initramfs files are necessary to boot successfully.

  1. Run the following commands:

    cd /rescue/boot
    ls -l
  2. Check the output to verify that there are vmlinuz and initramfs files that correspond to the kernel version that you want to boot.

    The following output example is for an Amazon Linux 2 instance with kernel version, 4.14.165-131.185.amzn2.x86_64. The /boot directory has the files initramfs-4.14.165-131.185.amzn2.x86_64.img and vmlinuz-4.14.165-131.185.amzn2.x86_64 to boot successfully.

    uname -r
    4.14.165-131.185.amzn2.x86_64
    
    cd /boot; ls -l
    total 39960
    -rw-r--r-- 1 root root      119960 Jan 15 14:34 config-4.14.165-131.185.amzn2.x86_64
    drwxr-xr-x 3 root root     17 Feb 12 04:06 efi
    drwx------ 5 root root       79 Feb 12 04:08 grub2
    -rw------- 1 root root 31336757 Feb 12 04:08 initramfs-4.14.165-131.185.amzn2.x86_64.img
    -rw-r--r-- 1 root root    669087 Feb 12 04:08 initrd-plymouth.img
    -rw-r--r-- 1 root root    235041 Jan 15 14:34 symvers-4.14.165-131.185.amzn2.x86_64.gz
    -rw------- 1 root root   2823838 Jan 15 14:34 System.map-4.14.165-131.185.amzn2.x86_64
    -rwxr-xr-x 1 root root   5718992 Jan 15 14:34 vmlinuz-4.14.165-131.185.amzn2.x86_64
  3. If the initramfs and the vmlinuz files aren't present, then boot the instance with a previous kernel that has both of the files. For instructions, see How do I revert to a known stable kernel after an update prevents my Amazon EC2 instance from rebooting successfully?

  4. Run the umount command to unmount the secondary device from your rescue instance:

    $ sudo umount /rescue

    If the unmount operation doesn't succeed, then you might have to stop or reboot the rescue instance for a clean unmount.

  5. Detach the secondary volume (/dev/sdf) from the rescue instance. Then, attach it to the original instance as /dev/xvda (root volume).

  6. Start the instance, and then verify if the instance is responsive.

For more information on how to resolve kernel panic errors, see Why do I see a "Kernel panic" error after I upgrade the kernel or reboot my EC2 Linux instance?

Troubleshoot "Failed to mount" or "Dependency failed"

Errors such as "Failed to mount" or "Dependency failed" in your system log indicate that the /etc/fstab file has incorrect mount point entries.

  1. Verify that the mount point entries in the /etc/fstab are correct. To correct the /etc/fstab file entries, see Why is my EC2 Linux instance going into emergency mode when I try to boot it?

  2. It's a best practice to run the fsck or xfs_repair tool to correct inconsistencies in the file system.

    Note: Create a backup of your file system before you run the fsck or xfs_repair tool.

    Run the umount command to unmount your mount point before you run the fsck or xfs_repair tool:

    $ sudo umount /rescue

    Run the fsck or xfs_repair tool, based on your file system.

    For ext4 file systems, run the following command:

    $ sudo fsck /dev/sdf
    fsck from util-linux 2.30.2
    e2fsck 1.42.9 (28-Dec-2013)
    /dev/sdf: clean, 11/6553600 files,
    459544/26214400 blocks

    For XFS file systems, run the following command:

    $ sudo xfs_repair /dev/sdf
    xfs_repair /dev/xvdf
    Phase 1 - find and verify superblock...
    Phase 2 - using internal log
            - zero log...
            - scan filesystem freespace and inode maps...
            - found root inode chunk
    Phase 3 - for each AG...
            - scan and clear agi unlinked lists...
            - process known inodes and perform inode discovery...
            - agno = 0
            - agno = 1
            - agno = 2
            - agno = 3
            - process newly discovered inodes...
    Phase 4 - check for duplicate blocks...
            - setting up duplicate extent list...
            - check for inodes claiming duplicate blocks...
            - agno = 0
            - agno = 1
            - agno = 2
            - agno = 3
    Phase 5 - rebuild AG headers and trees...
            - reset superblock...
    Phase 6 - check inode connectivity...
            - resetting contents of realtime bitmap and summary inodes
            - traversing filesystem ...
            - traversal finished ...
            - moving disconnected inodes to lost+found ...
    Phase 7 - verify and correct link counts...
    done
  3. Detach the secondary volume (/dev/sdf) from the rescue instance. Then, attach it to the original instance as /dev/xvda (root volume).

  4. Start the instance, and then verify if the instance is responsive.

Troubleshoot "interface eth0: failed"

Verify that the ifcfg-eth0 file has the correct network entries. The network configuration file that corresponds to the primary interface, eth0, is located at /etc/sysconfig/network-scripts/ifcfg-eth0. If the device name of your primary interface isn't eth0, then the file name starts with ifcfg and is followed by the device name. The file is in the /etc/sysconfig/network-scripts directory on the instance.

  1. Run the cat command to view the network configuration file for the primary interface, eth0.

    The following are the correct entries for the network configuration file located in /etc/sysconfig/network-scripts/ifcfg-eth0.

    Note: If needed, replace eth0 in the following command with the name of your primary interface.

    $ sudo cat /etc/sysconfig/network-scripts/ifcfg-eth0
    DEVICE=eth0
    BOOTPROTO=dhcp
    ONBOOT=yes
    TYPE=Ethernet
    USERCTL=yes
    PEERDNS=yes
    DHCPV6C=yes
    DHCPV6C_OPTIONS=-nw
    PERSISTENT_DHCLIENT=yes
    RES_OPTIONS="timeout:2 attempts:5"
    DHCP_ARP_CHECK=no
  2. Verify that ONBOOT is set to yes. If ONBOOT isn't set to yes, then eth0 (or your primary network interface) isn't configured to come up at boot.

    To change the ONBOOT value, first open the file in an editor. This example uses the vi editor:

    $ sudo vi /etc/sysconfig/network-scripts/ifcfg-eth0

    Press I to insert.

    Scroll the cursor to the ONBOOT entry, and then change the value to yes.

    To save and exit the file, press :wq!

  3. Run the umount command to unmount the secondary device from your rescue instance:

    $ sudo umount /rescue

    If the unmount operation isn't successful, then you might have to stop or reboot the rescue instance to initiate a clean unmount.

  4. Detach the secondary volume (/dev/sdf) from the rescue instance. Then attach it to the original instance as /dev/xvda (root volume).

  5. Start the instance and then verify if the instance is responsive.

Related information

Why is my EC2 Linux instance unreachable and failing its status checks?

Troubleshoot instances with failed status checks

Why is my Linux instance not booting after I changed its type to a Nitro-based instance type?

AWS OFFICIAL
AWS OFFICIALUpdated 3 months ago
3 Comments

Thanks for the very detailed and well structured article.

profile picture
replied a year ago

In the "Method 3: Manually correct errors using a rescue instance", when you try to mount the disk on step no 7 for problematic boot volumes, you would get an error stating that "Wrong Fs type or UUID duplicate, Superblock is missing or badblock found" this is because of boot volumes UUID are conflicting with the rescue server boot UUID and validate the disks boot UUID using "blkid" command and mount the volumes if its xfs using this command "mount -t xfs -o nouuid /dev/vg/lv /mnt" and refer the https://access.redhat.com/solutions/5494781 for reference

AWS
replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied a year ago