Skip to content

Help! Locked out in the middle of upgrading via dnf

0

SSH is no longer responding (even Ctrl+C isn't responding) and new SSH clients can't connect for the last hour while the machine is in the middle of installing an upgrade via sudo dnf upgrade --releasever=2023.6.20241010. It stopped responding after displaying the following last few lines: Running transaction Running scriptlet: selinux-policy-targeted-38.1.45-1.amzn2023.0.1.noarch 1/1 Preparing : 1/1 Upgrading : openssl-libs-1:3.0.8-1.amzn2023.0.16.x86_64 1/59 Upgrading : php8.2-common-8.2.23-1.amzn2023.0.1.x86_64 2/59 Upgrading : php8.2-pdo-8.2.23-1.amzn2023.0.1.x86_64 3/59 Upgrading : php8.2-cli-8.2.23-1.amzn2023.0.1.x86_64 4/59 Upgrading : php8.2-mbstring-8.2.23-1.amzn2023.0.1.x86_64 5/59 Upgrading : php8.2-opcache-8.2.23-1.amzn2023.0.1.x86_64 6/59 Upgrading : php8.2-process-8.2.23-1.amzn2023.0.1.x86_64 7/59 Upgrading : php8.2-sodium-8.2.23-1.amzn2023.0.1.x86_64 8/59 Upgrading : php8.2-xml-8.2.23-1.amzn2023.0.1.x86_64 9/59 Upgrading : selinux-policy-38.1.45-1.amzn2023.0.1.noarch 10/59 Running scriptlet: selinux-policy-38.1.45-1.amzn2023.0.1.noarch I fear the latter script has locked me out by changing the security policy somehow. I last successfully upgraded to the latest releasevera couple of months ago and haven't had any problems dong it until today. Earlier today, I had checked what releasevers were available, i.e. 2023.5.20240916, 2023.5.20241001, 2023.6.20241010. I only tried to upgrade straight to 2023.6.20241010, not the other two first. EC2 Monitoring graphs show that ever since, it has been constantly at 100% CPU and constant small storage read & write, with my CPU Credit Balance gradually going down (currently at about half of the max). I can no longer SSH in from my PC using a different window, and neither can EC2 Instance Connect (even after I authorized port 22 for the EC2 Instance Connect service IP addresses in my Region, as recommended).

I could try restarting the instance, but as it is still constantly at 100% CPU and doing reads & writes, I'm worried doing this in the middle of upgrading packages (including removing and installing kernel, and upgrading openssl, selinux-policy, selinux-policy-targeted), might leave the OS in a permanently broken or locked-out state. It has never taken anywhere near this long to upgrade, so I'm concerned that the script or a later part of the installation processes is bugged or stuck in an infinite loop or something like that. In total, there 30 packages being updated: amazon-linux-repo-s3-2023.6.20241010-0.amzn2023.noarch.rpm kernel-libbpf-6.1.112-122.189.amzn2023.x86_64.rpm aws-cfn-bootstrap-2.0-31.amzn2023.noarch.rpm c-ares-1.19.1-1.amzn2023.0.1.x86_64.rpm iproute-6.10.0-319.amzn2023.0.1.x86_64.rpm kernel-livepatch-repo-s3-2023.6.20241010-0.amzn2023.noarch.rpm kernel-tools-6.1.112-122.189.amzn2023.x86_64.rpm libgcrypt-1.10.2-1.amzn2023.0.2.x86_64.rpm nginx-filesystem-1.24.0-1.amzn2023.0.4.noarch.rpm openssl-3.0.8-1.amzn2023.0.16.x86_64.rpm openssl-libs-3.0.8-1.amzn2023.0.16.x86_64.rpm amazon-ssm-agent-3.3.859.0-1.amzn2023.x86_64.rpm php8.2-8.2.23-1.amzn2023.0.1.x86_64.rpm kernel-6.1.112-122.189.amzn2023.x86_64.rpm php8.2-common-8.2.23-1.amzn2023.0.1.x86_64.rpm php8.2-cli-8.2.23-1.amzn2023.0.1.x86_64.rpm php8.2-fpm-8.2.23-1.amzn2023.0.1.x86_64.rpm php8.2-intl-8.2.23-1.amzn2023.0.1.x86_64.rpm php8.2-mbstring-8.2.23-1.amzn2023.0.1.x86_64.rpm php8.2-opcache-8.2.23-1.amzn2023.0.1.x86_64.rpm php8.2-mysqlnd-8.2.23-1.amzn2023.0.1.x86_64.rpm php8.2-sodium-8.2.23-1.amzn2023.0.1.x86_64.rpm php8.2-pdo-8.2.23-1.amzn2023.0.1.x86_64.rpm php8.2-process-8.2.23-1.amzn2023.0.1.x86_64.rpm python3-pip-wheel-21.3.1-2.amzn2023.0.8.noarch.rpm php8.2-zip-8.2.23-1.amzn2023.0.1.x86_64.rpm php8.2-xml-8.2.23-1.amzn2023.0.1.x86_64.rpm selinux-policy-38.1.45-1.amzn2023.0.1.noarch.rpm system-release-2023.6.20241010-0.amzn2023.noarch.rpm selinux-policy-targeted-38.1.45-1.amzn2023.0.1.noarch.rpm

What should I do? Is restarting the instance my only option?

asked 2 years ago1.2K views
4 Answers
2
Accepted Answer

Given the situation you've described, it appears that your Amazon Linux 2023 instance has encountered an issue during the upgrade process. While it's concerning that the upgrade has been running for an extended period and the system is unresponsive, there are a few things to consider before taking action.

  1. Patience: Sometimes, upgrades can take longer than expected, especially when dealing with critical system components like the kernel, OpenSSL, and SELinux policies. The high CPU usage could indicate that the system is still actively working on the upgrade.

  2. Avoid interruption if possible: Interrupting an upgrade process, especially one involving core system components, can potentially leave the system in an inconsistent state. This is why it's generally recommended to avoid forcibly stopping an upgrade in progress.

  3. Monitor for changes: Keep an eye on the CPU usage and any changes in the instance's behavior. If you see the CPU usage drop or any signs of progress, it might indicate that the upgrade is still proceeding.

However, given that it's been unresponsive for an extended period, you may need to consider more drastic measures:

  1. Restart as a last resort: If the system remains unresponsive for several hours with no signs of progress, a restart might be your only option. Be aware that this carries some risk of leaving the system in an inconsistent state.

  2. Before restarting: If possible, create a snapshot of the EBS volume attached to your instance. This will allow you to recover data or potentially roll back if the restart results in an unbootable system.

  3. After restarting: If the system boots successfully, immediately check the status of the upgrade using "sudo dnf upgrade --releasever=2023.6.20241010" to see if it needs to be completed or if it will roll back changes.

  4. Recovery options: If the system fails to boot properly after a restart, you may need to attach the volume to another instance to recover data or attempt to fix the system manually.

For future upgrades, consider the following best practices:

  • Perform upgrades during low-traffic periods
  • Use Amazon EC2 Systems Manager Patch Manager for more controlled upgrades
  • Consider using Amazon Machine Images (AMIs) for testing upgrades before applying them to production instances

Remember, while upgrading to the latest version is important for security and features, it's crucial to balance this with system stability and uptime requirements.
Sources
Using Deterministic upgrades through versioned repositories - Amazon Linux 2023
Amazon Linux 2023, a Cloud-Optimized Linux Distribution with Long-Term Support | AWS News Blog

answered 2 years ago
EXPERT
reviewed 2 years ago
0

Update: Solved! Two hours after I started the dnf command, EC2 Monitoring graphs showed that the CPU usage dropped to near zero, and read & writes dropped to near-zero. I still wasn't able to connect via SSH, so I waited half an hour out of caution, then successfully restarted the instance and I'm now able to SSH in. Phew! Patience paid off. I also created a snapshot of the EBS volume beforehand, just in case.

answered 2 years ago
0

I've found a similar situation on an EC2 instance running Rocky 8 I ran dnf -y upgrade and the instance got as far as Upgrading : selinux-policy-3.14.3-139.el8_10.1.noarch Running scriptlet: selinux-policy-3.14.3-139.el8_10.1.noarch

At this point, it has been doing this for more than an hour. Its worth noting that, in my case, I'm using a t3a.micro instance so I dont have much resources

Eventually, the process timed out with this error /var/tmp/rpm-tmp.772Ak9: line 1: 21876 Killed semodule -nB

and then it continues with Running scriptlet: selinux-policy-targeted-3.14.3-139.el8_10.1.noarch Upgrading : selinux-policy-targeted-3.14.3-139.el8_10.1.noarch

So, as was mentioned above, patience is the key. The kill is likely to be an OoM issue so, once I get back in, I'll create a swap device for additional memory and rerun

answered a year ago
0

To add a little more to my last response, as I am in the beginnings of writing some docs for a new process, I was able to delete the instance and start again This time, I added dd if=/dev/zero of=/swapfile bs=1M count=2000 chmod 0600 /swapfile mkswap /swapfile swapon /swapfile

before I ran the dnf -y update

This time, the entire scriptlets all ran through and, from booting my new instance to performing a full updat was completed inside of 13 minutes. During this time, it used ~150MB of swap (although I didnt get the exact number)

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.