My EC2 Linux instance failed the instance status check due to operating system issues. How do I troubleshoot this?

10 minutos de lectura
1

My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance failed the instance status check due to operating system issues. Now it doesn't boot successfully. How can I resolve this?

Short description

Your EC2 Linux instance might fail the instance status check for the following reasons:

  • You updated the kernel and the new kernel didn’t boot.
  • The file system entries in /etc/fstab are incorrect or the file system is corrupted.
  • There are incorrect network configurations on the instance.

Resolution

There are 3 methods for troubleshooting OS issues.

Important:

Methods 2 and 3 require a stop and start of the instance. Be aware of the following:

  • If your instance is instance store-backed or has instance store volumes containing data, the data is lost when you stop the instance. For more information, see Determine the root device type of your instance.
  • If your instance is part of an Amazon EC2 Auto Scaling group, stopping the instance might terminate the instance. If you launched the instance with Amazon EMR, AWS CloudFormation, or AWS Elastic Beanstalk, then your instance might be part of an AWS Auto Scaling group. Instance termination in this scenario depends on the instance scale-in protection settings for your Auto Scaling group. If your instance is part of an Auto Scaling group, then temporarily remove the instance from the Auto Scaling group before starting the resolution steps.
  • Stopping and starting the instance changes the public IP address of your instance. It's a best practice to use an Elastic IP address instead of a public IP address when routing external traffic to your instance. If you are using Amazon Route 53, you might have to update the Route 53 DNS records when the public IP changes.

Method 1: Use the EC2 Serial Console

If you enabled EC2 Serial Console for Linux, then you can use it to troubleshoot supported Nitro-based instance types. The serial console helps you troubleshoot boot issues, network configuration, and SSH configuration issues. The serial console connects to your instance without the need for a working network connection. You can access the serial console using the Amazon EC2 console or the AWS Command Line Interface (AWS CLI).

Before using the serial console, grant access to it at the account level. Then, create AWS Identity and Access Management (IAM) policies granting access to your IAM users. Also, every instance using the serial console must include at least one password-based user. If your instance is unreachable and you haven’t configured access to the serial console, follow the instructions in Method 2. For information on configuring the EC2 Serial Console for Linux, see Configure access to the EC2 Serial Console.

Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

Method 2: Run the EC2Rescue for Linux tool

EC2Rescue for Linux automates diagnosing and troubleshooting operating system issues on unreachable instances. For more information, see How do I use EC2Rescue for Linux to troubleshoot operating system-level issues?

Method 3: Manually correct errors using a rescue instance

1.    Launch a new EC2 instance in your virtual private cloud (VPC) using the same Amazon Machine Image (AMI). Launch the new instance in the same Availability Zone as the impaired instance. The new instance becomes your rescue instance.

Or, use an existing instance that you can access, if it uses the same AMI and is in the same Availability Zone as your impaired instance.

2.    Stop the impaired instance.

3.    Detach the Amazon Elastic Block Store (Amazon EBS) root volume (/dev/xvda or /dev/sda1) from your impaired instance. Note the device name (/dev/xvda or /dev/sda1) of your root volume.

4.    Attach the volume as a secondary device ( /dev/sdf) to the rescue instance.

5.    Connect to your rescue instance using SSH.

6.    Create a mount point directory (/rescue) for the new volume attached to the rescue instance:

$ sudo mkdir /rescue

7.    Mount the volume at the directory you created in step 6:

$ sudo mount /dev/xvdf1 /rescue

Note: The device (/dev/xvdf1) might be attached to the rescue instance with a different device name. Use the lsblk command to view your available disk devices along with their mount points to determine the correct device names.

8.    If you haven't already done so, retrieve the system log of the instance to verify what error is occurring. The next steps depend on the error message listed in the system log. The following is a list of common errors that can cause instance status check failure. For additional errors, see Troubleshooting system log errors for Linux-based instances.

Kernel Panic

If a Kernel Panic error message is in the system log, then the kernel might not have the vmlinuz or initramfs files. The vmlinuz and initramfs files are necessary to boot successfully.

1.    Run the following commands:

cd /rescue/boot
ls -l

2.    Check the output to verify that there are vmlinuz and initramfs files corresponding to the kernel version you intend to boot.

The following output example is for an Amazon Linux 2 instance with kernel version, 4.14.165-131.185.amzn2.x86_64. The /boot directory has the files initramfs-4.14.165-131.185.amzn2.x86_64.img and vmlinuz-4.14.165-131.185.amzn2.x86_64, so it will boot successfully.

uname -r
4.14.165-131.185.amzn2.x86_64

cd /boot; ls -l
total 39960
-rw-r--r-- 1 root root      119960 Jan 15 14:34 config-4.14.165-131.185.amzn2.x86_64
drwxr-xr-x 3 root root     17 Feb 12 04:06 efi
drwx------ 5 root root       79 Feb 12 04:08 grub2
-rw------- 1 root root 31336757 Feb 12 04:08 initramfs-4.14.165-131.185.amzn2.x86_64.img
-rw-r--r-- 1 root root    669087 Feb 12 04:08 initrd-plymouth.img
-rw-r--r-- 1 root root    235041 Jan 15 14:34 symvers-4.14.165-131.185.amzn2.x86_64.gz
-rw------- 1 root root   2823838 Jan 15 14:34 System.map-4.14.165-131.185.amzn2.x86_64
-rwxr-xr-x 1 root root   5718992 Jan 15 14:34 vmlinuz-4.14.165-131.185.amzn2.x86_64

3.    If the initramfs and or the vmlinuz files aren't present, try booting the instance using a previous kernel that has both of these files. For instructions on how to boot your instance using a previous kernel, see How do I revert to a known stable kernel after an update prevents my Amazon EC2 instance from rebooting successfully?

4.    Run the unmount command to unmount the secondary device from your rescue instance:

$ sudo umount /rescue

If the unmount operation isn't successful, then you might have to stop or reboot the rescue instance to enable a clean unmount.

5.    Detach the secondary volume (/dev/sdf) from the rescue instance, and then attach it to the original instance as /dev/xvda (root volume).

6.    Start the instance, and then verify if the instance is responsive.

For additional information on resolving kernel panic errors, see I'm receiving a "Kernel panic" error after I've upgraded the kernel or tried to reboot my EC2 Linux instance. How can I fix this?

Failed to mount or Dependency failed

If you see errors such as "Failed to mount" or "Dependency failed" in your system log, the /etc/fstab file might have incorrect mount point entries.

1.    Verify the mount point entries in the /etc/fstab are correct. For information on correcting the /etc/fstab file entries, see the Auto mount failures because of incorrect entries in the /etc/fstab file section of Why is my EC2 instance not booting and going into emergency mode?

2.    It's a best practice to run the fsck or xfs_repair tool to correct any file system errors. If there are inconsistencies in the file system, the fsck or xfs_repair tool corrects them.

Note: Create a backup of your file system before running the fsck or xfs_repair tool.

Run the unmount command to unmount your mount point before running the fsck or xfs_repair tool:

$ sudo umount /rescue

Run the fsk or xfs_repair tool, depending on your file system.

For ext4 file systems:

$ sudo fsck /dev/sdf
fsck from util-linux 2.30.2
e2fsck 1.42.9 (28-Dec-2013)
/dev/sdf: clean, 11/6553600 files,
459544/26214400 blocks

For XFS file systems:

$ sudo xfs_repair /dev/sdf
xfs_repair /dev/xvdf
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
done

3.    Detach the secondary volume (/dev/sdf) from the rescue instance, and then attach it to the original instance as /dev/xvda (root volume).

4.    Start the instance and then verify if the instance is responsive.

Bringing up interface eth0: failed

If you see the error "Bringing up interface eth0: failed", verify that the ifcfg-eth0 file has the correct network entries. The network configuration file corresponding to the primary interface, eth0, is located at /etc/sysconfig/network-scripts/ifcfg-eth0. If the device name of your primary interface isn't eth0, then there is a file that begins with ifcfg and is followed by the name of your device in the directory /etc/sysconfig/network-scripts on the instance.

1.    Run the cat command to view the network configuration file for the primary interface, eth0.

The following are the correct entries for the network configuration file located in /etc/sysconfig/network-scripts/ifcfg-eth0.

Note: Replace eth0 in the following command with the name of your primary interface, if different.

$ sudo cat /etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0
BOOTPROTO=dhcp
ONBOOT=yes
TYPE=Ethernet
USERCTL=yes
PEERDNS=yes
DHCPV6C=yes
DHCPV6C_OPTIONS=-nw
PERSISTENT_DHCLIENT=yes
RES_OPTIONS="timeout:2 attempts:5"
DHCP_ARP_CHECK=no

2.    Verify that ONBOOT is set to yes, as shown in the preceding example. If ONBOOT isn't set to yes, then eth0 (or your primary network interface) isn't configured to come up at boot.

To change the ONBOOT value:

Open the file in an editor. In the example given here, the vi editor is used.

$ sudo vi /etc/sysconfig/network-scripts/ifcfg-eth0

Press I to insert.

Scroll the cursor to the ONBOOT entry, and then change the value to yes.

Save and exit the file by pressing :wq!.

3.    Run the unmount command to unmount the secondary device from your rescue instance:

$ sudo umount /rescue

If the unmount operation isn't successful, then you might have to stop or reboot the rescue instance to enable a clean unmount.

4.    Detach the secondary volume (/dev/sdf) from the rescue instance, and then attach it to the original instance as /dev/xvda (root volume).

5.    Start the instance and then verify if the instance is responsive


Related information

Why is my EC2 Linux instance unreachable and failing one or both of its status checks?

Troubleshoot instances with failed status checks

Why is my Linux instance not booting after I changed its type to a Nitro-based instance type?

OFICIAL DE AWS
OFICIAL DE AWSActualizada hace 2 años
1 comentario

Thanks for the very detailed and well structured article.

profile picture
respondido hace 2 meses