How do I revert to a known stable kernel after an update prevents my Amazon EC2 instance from rebooting successfully?

9 minute read
1

An update prevented my Amazon Elastic Compute Cloud (Amazon EC2) instance from rebooting successfully, and I want to revert to a stable kernel.

Short description

If you made a kernel update to your Amazon EC2 Linux instance but the kernel is now corrupt, then the instance can't reboot. You can't use SSH to connect to the impaired instance.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshoot AWS CLI errors. Also, make sure that you use the most recent AWS CLI version.

Access the instance's root volume

There are two methods to access the root volume. Choose the one that best fits your use case.

Use the EC2 serial console

If you turned on EC2 serial console for Linux, then use it to troubleshoot Nitro-based instance types. The serial console helps you troubleshoot boot issues, network configuration, and SSH configuration issues. The serial console connects to your instance without a working network connection. To access the serial console, use the Amazon EC2 console or the AWS Command Line Interface (AWS CLI).

Before you use the serial console, grant it access at the account level. Then, create AWS Identity and Access Management (IAM) policies that grant access to your IAM users. Every instance that uses the serial console must include at least one password-based user. For more information, see Configure access to the EC2 Serial Console.

If your instance is unreachable and you didn't configure access to the serial console, then use the instructions in the following section.

Use a rescue instance

Create a temporary rescue instance, then remount your Amazon Elastic Block Store (Amazon EBS) volume on the rescue instance. From the rescue instance, configure your GNU GRUB to use the previous kernel.

Important: Don't perform this procedure on an instance store-backed instance. Because the recovery procedure requires a stop and start of your instance, any data on that instance is lost. For more information, see Determine the root device type of your instance.

  1. Create an Amazon EBS snapshot of the root volume. For more information, see Create Amazon EBS snapshots.

  2. Open the Amazon EC2 console.
    Note: Be sure that you're in the correct Region.

  3. From the navigation pane, select Instances, and then choose the impaired instance.

  4. Choose Instance State, Stop instance, and then choose Stop.

  5. In the Storage tab, under Block devices, select the Volume ID for /dev/sda1 or /dev/xvda.
    Note: The root device differs by AMI, but /dev/xvda or /dev/sda1 are reserved for the root device. For example, Amazon Linux 1, Amazon Linux 2, and Amazon Linux 2023 use /dev/xvda. Other distributions, such as Ubuntu 14, 16, 18, CentOS 7, and RHEL 7.5, use /dev/sda1.

  6. Choose Actions, Detach Volume, and then select Yes, Detach.
    Note: To help identify the EBS volume in later steps, tag the volume before you detach it.

  7. Launch a rescue EC2 instance in the same Availability Zone as your snapshot.
    Note: Depending on the product code, you might be required to launch an EC2 instance of the same OS type. For example, if the impaired EC2 instance is a paid RHEL AMI, then you must launch an AMI with the same product code. For more information, see Get the product code for your instance.

  8. After the rescue instance launches, choose Volumes from the navigation pane. Then, choose the detached root volume of the impaired instance.

  9. Choose Actions, Attach Volume.

  10. Choose the rescue instance ID (id-#####), and then set an unused device. In this example, /dev/sdf.

  11. Use SSH to connect to the rescue instance.

  12. To view your available disk devices, run the lsblk command: The following is an example of the output:

    NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    xvda    202:0     0   15G  0 disk
    └─xvda1 202:1     0   15G  0 part /
    xvdf    202:0     0   15G  0 disk
        └─xvdf1 202:1 0   15G  0 part

    Note: Nitro-based instances expose EBS volumes as NVMe block devices. The output that the lsblk command generates on Nitro-based instances shows the disk names as nvme[0-26]n1. For more information, see Amazon EBS and NVMe. The following is an example of the lsblk command output on a Nitro-based instance:

    NAME           MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
    nvme0n1        259:0    0    8G  0 disk 
    └─nvme0n1p1    259:1    0    8G  0 part /
    └─nvme0n1p128  259:2    0    1M  0 part 
    nvme1n1        259:3    0  100G  0 disk 
    └─nvme1n1p1    259:4    0  100G  0 part /
  13. To become the root user, run the following command:

    sudo -i
  14. Mount the root partition of the mounted volume to /mnt. In the following example, /dev/xvdf1 or /dev/nvme1n1p1 is the root partition of the mounted volume. For more information, see Make an Amazon EBS volume available for use. Replace /dev/xvdf1 with the correct root partition for your volume.

    mount -o nouuid /dev/xvdf1 /mnt

    Note: If /mnt doesn't exist on your configuration, then create a mount directory. Then, mount the root partition of the mounted volume to the new directory.

    mkdir /mnt
    mount -o nouuid /dev/xvdf1 /mnt

    Then access the data of the impaired instance through the mount directory.

  15. Mount /dev, /run, /proc, and /sys of the rescue instance to the same paths as the newly mounted volume:

    for m in dev proc run sys; do mount -o bind {,/mnt}/$m; done

    Call the chroot function to change into the mount directory.

    Note: If you have a separate /boot partition, then mount it to /mnt/boot before you run the following command.

    chroot /mnt

Update the default kernel in the GRUB bootloader

The current corrupt kernel is in position 0 (zero) in the list. The last stable kernel is in position 1. To replace the corrupt kernel with the stable kernel, use one of the following procedures, based on your distribution.

GRUB1 (Legacy GRUB) for Red Hat 6 and Amazon Linux 1

Use the sed command to replace the corrupt kernel with the stable kernel in the /boot/grub/grub.conf file:

sed -i '/^default/ s/0/1/' /boot/grub/grub.conf

GRUB2 for Ubuntu 14 LTS, 16.04, and 18.04

Complete the following steps:

  1. Replace the corrupt GRUB_DEFAULT=0 default menu entry with the stable GRUB_DEFAULT=saved value in the /etc/default/grub file:

    sed -i 's/GRUB_DEFAULT=0/GRUB_DEFAULT=saved/g' /etc/default/grub
  2. To make sure that GRUB recognizes the change, run the update-grub command:

    update-grub
  3. Run the grub-set-default command so that the stable kernel loads at the next reboot. In the following example, grub-set-default is set to 1 in position 0:

    grub-set-default 1

GRUB2 for RHEL 7 and Amazon Linux 2

Complete the following steps:

  1. Replace the corrupt GRUB_DEFAULT=0 default menu entry with the stable GRUB_DEFAULT-saved value in the /etc/default/grub file:

    sed -i 's/GRUB_DEFAULT=0/GRUB_DEFAULT=saved/g' /etc/default/grub
  2. Update GRUB to regenerate the /boot/grub2/grub.cfg file:

    grub2-mkconfig -o /boot/grub2/grub.cfg
  3. To make sure that the stable kernel loads at the next reboot, run the grub2-set-default command. In the following example grub2-set-default is set to 1 in position 0:

    grub2-set-default 1

GRUB2 for RHEL 8 and CentOS 8, and Amazon Linux 2023

GRUB2 uses blscfg files and entries in /boot/loader for the boot configuration, instead of the previous grub.cfg format. It's a best practice to use the grubby tool to manage the blscfg files and retrieve information from the /boot/loader/entries/. If the blscfg files are missing or corrupted, then grubby doesn't show any results. You must regenerate the files to recover functionality. The kernels' indexing depends on the .conf files under /boot/loader/entries and on the kernel versions. Indexing keeps the latest kernel with the lowest index. For information, see How do I recover my Red Hat 8 or CentOS 8 instance that fails to boot because of issues with the GRUB2 BLS configuration file?

Complete the following steps:

  1. To see the current default kernel, run the grubby --default-kernel command:

    grubby --default-kernel
  2. To see all available kernels and their indexes, run the grubby --info=ALL command:

    grubby --info=ALL

    The following is an example output from the --info=ALL command:

    root@ip-172-31-29-221 /]# grubby --info=ALL
    index=0
    kernel="/boot/vmlinuz-4.18.0-305.el8.x86_64"
    args="ro console=ttyS0,115200n8 console=tty0 net.ifnames=0 rd.blacklist=nouveau nvme_core.io_timeout=4294967295 crashkernel=auto $tuned_params"
    root="UUID=d35fe619-1d06-4ace-9fe3-169baad3e421"
    initrd="/boot/initramfs-4.18.0-305.el8.x86_64.img $tuned_initrd"
    title="Red Hat Enterprise Linux (4.18.0-305.el8.x86_64) 8.4 (Ootpa)"
    id="0c75beb2b6ca4d78b335e92f0002b619-4.18.0-305.el8.x86_64"
    index=1
    kernel="/boot/vmlinuz-0-rescue-0c75beb2b6ca4d78b335e92f0002b619"
    args="ro console=ttyS0,115200n8 console=tty0 net.ifnames=0 rd.blacklist=nouveau nvme_core.io_timeout=4294967295 crashkernel=auto"
    root="UUID=d35fe619-1d06-4ace-9fe3-169baad3e421"
    initrd="/boot/initramfs-0-rescue-0c75beb2b6ca4d78b335e92f0002b619.img"
    title="Red Hat Enterprise Linux (0-rescue-0c75beb2b6ca4d78b335e92f0002b619) 8.4 (Ootpa)"
    id="0c75beb2b6ca4d78b335e92f0002b619-0-rescue"
    index=2
    kernel="/boot/vmlinuz-4.18.0-305.3.1.el8_4.x86_64"
    args="ro console=ttyS0,115200n8 console=tty0 net.ifnames=0 rd.blacklist=nouveau nvme_core.io_timeout=4294967295 crashkernel=auto $tuned_params"
    root="UUID=d35fe619-1d06-4ace-9fe3-169baad3e421"
    initrd="/boot/initramfs-4.18.0-305.3.1.el8_4.x86_64.img $tuned_initrd"
    title="Red Hat Enterprise Linux (4.18.0-305.3.1.el8_4.x86_64) 8.4 (Ootpa)"
    id="ec2fa869f66b627b3c98f33dfa6bc44d-4.18.0-305.3.1.el8_4.x86_64"

    Note the path of the kernel that you set as the default for your instance. In this example, the path for the kernel at index 2 is**/boot/vmlinuz- 0-4.18.0-80.4.2.el8_1.x86_64**.

  3. To change the default kernel of the instance, run the grubby --set-default command:

    grubby --set-default=/boot/vmlinuz-4.18.0-305.3.1.el8_4.x86_64

    Note: Replace 4.18.0-305.3.1.el8_4.x86_64 with your kernel's version number.

  4. To verify that the previous command was successful, run the grubby --default-kernel command:

    grubby --default-kernel

    If you access the instance through the EC2 Serial Console, then the stable kernel now loads and you can reboot the instance. If you use a rescue instance, then complete the steps in the following section.

Unmount volumes, detach the root volume from the rescue instance, and then attach the volume to the impaired instance

If you used a rescue instance to access the root volume, then complete the following steps:

  1. Run the following command to exit from chroot, and unmount /dev, /run, /proc, and /sys:

    exit
    umount /mnt/{dev,proc,run,sys,}
  2. From the Amazon EC2 console, choose Instances. Then, choose the rescue instance.

  3. Choose Instance State, Stop instance, and then select Yes, Stop.

  4. Detach the root volume's impaired instance from the rescue instance.

  5. Attach the root volume you detached to the impaired instance as the root volume (/dev/sda1). Then, start the instance.

Note: The root device differs by AMI. The names /dev/xvda or /dev/sda1 are reserved for the root device. For example, Amazon Linux 1, Amazon Linux 2, and Amazon Linux 2023 use /dev/xvda. Other distributions, such as Ubuntu 14, 16, 18, CentOS 7, and RHEL 7.5, use /dev/sda1.

The stable kernel now loads, and your instance reboots.

AWS OFFICIAL
AWS OFFICIALUpdated 4 months ago
2 Comments

Stuck at step #14, my system throws an error when I try to use the mount -o nouuid option. I can, however, just to a normal mount and that seems to work fine ("mount /dev/nvme1n1p1 /mnt"), but I'm not sure if there are other implications with NOT using the -o nouuid option.

Also, if I push ahead, I get an error trying to "chroot /mnt": chroot: failed to run command ‘/bin/bash’: No such file or directory

Scott
replied 4 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 4 months ago