- Newest
- Most votes
- Most comments
Given the situation you've described, it appears that your Amazon Linux 2023 instance has encountered an issue during the upgrade process. While it's concerning that the upgrade has been running for an extended period and the system is unresponsive, there are a few things to consider before taking action.
-
Patience: Sometimes, upgrades can take longer than expected, especially when dealing with critical system components like the kernel, OpenSSL, and SELinux policies. The high CPU usage could indicate that the system is still actively working on the upgrade.
-
Avoid interruption if possible: Interrupting an upgrade process, especially one involving core system components, can potentially leave the system in an inconsistent state. This is why it's generally recommended to avoid forcibly stopping an upgrade in progress.
-
Monitor for changes: Keep an eye on the CPU usage and any changes in the instance's behavior. If you see the CPU usage drop or any signs of progress, it might indicate that the upgrade is still proceeding.
However, given that it's been unresponsive for an extended period, you may need to consider more drastic measures:
-
Restart as a last resort: If the system remains unresponsive for several hours with no signs of progress, a restart might be your only option. Be aware that this carries some risk of leaving the system in an inconsistent state.
-
Before restarting: If possible, create a snapshot of the EBS volume attached to your instance. This will allow you to recover data or potentially roll back if the restart results in an unbootable system.
-
After restarting: If the system boots successfully, immediately check the status of the upgrade using "sudo dnf upgrade --releasever=2023.6.20241010" to see if it needs to be completed or if it will roll back changes.
-
Recovery options: If the system fails to boot properly after a restart, you may need to attach the volume to another instance to recover data or attempt to fix the system manually.
For future upgrades, consider the following best practices:
- Perform upgrades during low-traffic periods
- Use Amazon EC2 Systems Manager Patch Manager for more controlled upgrades
- Consider using Amazon Machine Images (AMIs) for testing upgrades before applying them to production instances
Remember, while upgrading to the latest version is important for security and features, it's crucial to balance this with system stability and uptime requirements.
Sources
Using Deterministic upgrades through versioned repositories - Amazon Linux 2023
Amazon Linux 2023, a Cloud-Optimized Linux Distribution with Long-Term Support | AWS News Blog
Update: Solved! Two hours after I started the dnf command, EC2 Monitoring graphs showed that the CPU usage dropped to near zero, and read & writes dropped to near-zero. I still wasn't able to connect via SSH, so I waited half an hour out of caution, then successfully restarted the instance and I'm now able to SSH in. Phew! Patience paid off. I also created a snapshot of the EBS volume beforehand, just in case.
I've found a similar situation on an EC2 instance running Rocky 8 I ran dnf -y upgrade and the instance got as far as Upgrading : selinux-policy-3.14.3-139.el8_10.1.noarch Running scriptlet: selinux-policy-3.14.3-139.el8_10.1.noarch
At this point, it has been doing this for more than an hour. Its worth noting that, in my case, I'm using a t3a.micro instance so I dont have much resources
Eventually, the process timed out with this error /var/tmp/rpm-tmp.772Ak9: line 1: 21876 Killed semodule -nB
and then it continues with Running scriptlet: selinux-policy-targeted-3.14.3-139.el8_10.1.noarch Upgrading : selinux-policy-targeted-3.14.3-139.el8_10.1.noarch
So, as was mentioned above, patience is the key. The kill is likely to be an OoM issue so, once I get back in, I'll create a swap device for additional memory and rerun
To add a little more to my last response, as I am in the beginnings of writing some docs for a new process, I was able to delete the instance and start again This time, I added dd if=/dev/zero of=/swapfile bs=1M count=2000 chmod 0600 /swapfile mkswap /swapfile swapon /swapfile
before I ran the dnf -y update
This time, the entire scriptlets all ran through and, from booting my new instance to performing a full updat was completed inside of 13 minutes. During this time, it used ~150MB of swap (although I didnt get the exact number)
