How do I recover an EC2 Linux instance that’s failing to boot because of disk errors?

5 minute read
1

My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance that launched from a custom Amazon Machine Image (AMI) has a disk error.

Short description

The following misconfigurations in the fstab file cause disk errors for Amazon EC2 instances that are created from a custom AMI:

  • Incorrect device ID
  • Incorrect path name
  • Incorrect or duplicate UUID

To resolve these errors, you must access the instance's operating system. If your instance is inaccessible, then use one of the following ways to access it:

  • The EC2 serial console
  • A rescue instance to manually correct errors

Resolution

Use the EC2 serial console

If you turned on the EC2 serial console for Linux instances, then you can use it to troubleshoot supported Nitro-based instance types and bare metal instances. The serial console helps you troubleshoot boot issues and network and SSH configuration issues. The serial console connects to your instance without needing a working network connection. To access the serial console, use the Amazon EC2 console or the AWS Command Line Interface (AWS CLI).

If you're using the EC2 serial console for the first time, then make sure that you review the prerequisites, and configure access before trying to connect. If your instance is unreachable and you didn't configure access to the serial console, then follow the instructions in the following section Use a rescue instance to manually correct errors. For information on configuring the EC2 serial console, see Configure access to the EC2 serial console.

Use a rescue instance to manually correct errors

Warning: The following procedure requires stopping the instance. Data that's stored in instance store volumes is lost when the instance is stopped. Make sure that you save a backup of the data before stopping the instance. Unlike Amazon Elastic Block Store (Amazon EBS)-backed volumes, instance store volumes are ephemeral and don't support data persistence.

The static public IPv4 address that Amazon EC2 automatically assigned to the instance on launch or start changes after the stop and start. To retain a public IPv4 address that doesn't change when the instance is stopped, use an Elastic IP address.

For more information, see What happens when you stop an instance.

1.    Use the same AMI as the instance with the disk errors to launch a new EC2 instance in your virtual private cloud (VPC). Launch the new instance in the same Availability Zone as the impaired instance. The new instance becomes your rescue instance.

Or, use an existing instance that uses the same AMI and is in the same Availability Zone as your impaired instance.

2.    Stop the impaired instance.

3.    Detach the Amazon EBS root volume (/dev/xvda or /dev/sda1) from your impaired instance. Note the device name (/dev/xvda or /dev/sda1) of your root volume.

4.    Attach the volume as a secondary device (/dev/sdf) to the rescue instance.

5.    Use SSH to connect to your rescue instance.

6.    Create a mount point directory (/rescue) for the new volume that's attached to the rescue instance:

$ sudo mkdir /rescue

7.    Mount the volume at the directory:

$ sudo mount /dev/xvdf1 /rescue

Note: The device (/dev/xvdf1) might be attached to the rescue instance with a different device name. To determine the correct device names, use the lsblk command to view available disk devices along with their mount points.

7.    Troubleshoot the volume:

Run the following command to check the fstab entry:

$ cat /rescue/etc/fstab

The following example fstab entry has two volumes:

UUID=47834bf7-764e-42f9-9507-11a3e70b99de / xfs defaults,noatime 1 1
UUID=1b75bcf4-ee55-428e-8d68-88dca398da01 /test xfs defaults,nofail 0 2

In the fstab entry, check the following configurations:

  • The entry doesn't contain device IDs. Incorrect device IDs cause issues and inconsistencies. It's a best practice to use UUIDs instead of device IDs. If device IDs are in the file, then replace them with the volume's UUID. Run the following command to find the correct UUID:

    lsblk -f
  • Make sure that the path for the mount point is correct and exists in the volume.

  • Verify that there are no duplicate UUIDs. Also, verify that additional volumes on the failing instance don't use the same UUIDs. To check the UUIDs of additional volumes, use the preceding steps to attach them to the rescue instance. If there are duplicate UUIDs, then you can't mount volumes, and you receive an error similar to the following one:

    mount: /rescue: wrong fs type, bad option, bad superblock on /dev/xvdg1, missing codepage or helper program, or other error

    Run the following command to check the UUID of the attached volumes:

    $ lsblk -f

    If there are duplicate UUIDs, then run the following command to change the UUIDs:

    For XFS file systems:

    $ sudo xfs_admin -U <unique UUID> /dev/xvdb1

    For EXT4 file systems:

    $ tune2fs -U random /dev/sdb1

    Also, add the nofail option to your fstab entries, such as in the following example:

    UUID=aebf131c-6957-451e-8d34-ec978d9581ae /data xfs defaults,nofail 0 2

    Note: If you boot your instance without this specific volume attached, then the nofail mount option allows the instance to boot even when mount errors occur. For example, you might try to boot the instance after moving the volume to another instance. Debian derivatives, including Ubuntu versions earlier than 16.04, must also add the nobootwait mount option.

8.    After correcting the errors, save the file, and then run the umount command to unmount the mounted volume.

$ sudo umount /mnt/rescue

9.    Detach the volume from the temporary instance.

10.    Attach the volume to the original instance, and then start the instance to confirm that it boots successfully.

AWS OFFICIAL
AWS OFFICIALUpdated a year ago