Instance failed and won't reboot

0

We have a t2-large instance in US-west-1c where we updated the Amazon Linux 2023 OS with 'dnf' and now it won't restart after reboot. It's gets stuck in the 'initilizing' state when we stop and start it via the console.

profile picture
asked 15 days ago56 views
2 Answers
0
Accepted Answer

We ran into a similar issue recently after running dnf update on an Amazon Linux 2023 instance. After the reboot, it just stayed stuck in the “initializing” state and we couldn’t SSH in either.

A couple of things that helped us:

We used the EC2 Serial Console (make sure it’s enabled) to view the boot process and spotted some boot-level errors related to the update.

You can also try detaching the root volume, attaching it to another EC2 instance, and checking /var/log/ (especially dnf.log, messages, and boot.log) to see what went wrong after the update.

In our case, rolling back some changes manually from a rescue instance and reattaching the volume did the trick.

Also, worth checking if the kernel or grub config got modified during the update. Hope this helps — let me know if you need a quick guide on the rescue steps.

answered 15 days ago
profile picture
EXPERT
reviewed 15 days ago
0

I understand you're facing a challenging situation with your t2-large instance running Amazon Linux 2023 in US-west-1c. This is a common issue that can occur after system updates. Here are some steps you can take to troubleshoot and potentially resolve the problem:

  1. First, try using the AWS CLI to force stop the instance. Sometimes this can be more effective than using the console. The command would be:

    aws ec2 stop-instances --instance-ids your-instance-id --force

    You can run this command from AWS CloudShell if you don't have the CLI set up locally.

  2. If the force stop doesn't work after a few minutes, you may need to take more advanced steps:

    a. Stop the instance from the EC2 console, detach the root volume, and attach it to another healthy instance.

    b. SSH into the healthy instance and mount the detached volume to inspect system logs for any errors. Look in the /var/log/ directory of the mounted volume.

    c. If you identify any problematic services, you can disable them using the systemctl disable command.

    d. If the issue is due to a kernel update, you might need to set the instance to boot from an older kernel by modifying the GRUB configuration.

  3. After making any necessary changes, detach the volume from the healthy instance and reattach it to the original instance.

  4. Start the original instance and check if the issue is resolved.

  5. If the problem persists, you might need to consider rolling back recent updates. For dnf, you can use the command 'dnf history undo' to revert the last update.

  6. As a last resort, if you're unable to resolve the issue, you may need to create a new instance from the most recent AMI or snapshot of your instance before the update.

Remember, while your instance is in the 'stopping' or 'initializing' state, you are not being charged for its usage. However, given that this is likely a production instance, it's advisable to work on resolving this quickly to minimize downtime.

If these steps don't resolve your issue, you may need to reach out to AWS Support for more specialized assistance.
Sources
What do we do when an instance won't reboot or even force stop? | AWS re:Post
EC2 instance not restarting | AWS re:Post
Instance reachability check failed after yum update | AWS re:Post
One of my ec2 instance (Amazon Linux) stopped booting suddenly today! | AWS re:Post

profile picture
answered 15 days ago
profile picture
EXPERT
reviewed 15 days ago
  • Thank you. I eventually started a new instance and attached my data volume and the boot volume from a snapshot made last year. But I'll definitely take a look at the dnf log on the "bad boot volume" as you suggested. I hope it won't repeat since I did the same update on many other servers and only this one failed.

    It was particularly odd because I tried the replace root volume and it seemed to work, but then noted it later said it had failed, though I did have a volume created with an ID, but after I detached the prior one and tried to attach it, it somehow failed with the same sort of error, but then I was stuck because I couldn't remove the volume (it said it didn't exist) and I couldn't find it anymore. That's when I eventually launched a new instance and re-attached my volumes using the older snapshot of my root/boot volume.

    But I hope I can track down the cause or hope it doesn't repeat as the new instance of course wants the DNF update to be done again.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions