EC2 instance intermittently locking up (network OUT drops to zero)

0

Hi,

We have been running a good sized t3 instance without major issues for years. About a month ago we "upgraded" from centos7 to almalinux8 via cpanel script. After a week we started getting daily issues with the instance locking up (at varying times, usually once a day), CPU and RAM initially increase (below max) then network OUT traffic drops to zero and stays there permanently unless the instance is rebooted. We doubled the EC2 instance size and no change in behaviour. cPanel has said there are no known issues with the upgrade script.

We have never touched the AWS network settings. However:

Feb  5 06:05:00 {server} network[1139]: ERROR     : [/etc/sysconfig/network-scripts/ifup-eth] Device eth0 does not seem to be present, delaying initialization.

No network device is configured on the server, but presumably AWS takes care of this behind the scenes (otherwise it would never be visible online?). It is configured in the network-scripts folder:

DEVICE="eth0"
BOOTPROTO="dhcp"
ONBOOT="yes"
TYPE="Ethernet"
USERCTL="yes"
PEERDNS="yes"
IPV6INIT="no"
PERSISTENT_DHCLIENT="1"

Its not clear if network issues are the cause or symptom. Any pointers on possible causes would be gratefully received. This has been going for more than 3 weeks and is causing us a huge headache

Many thanks in advance

asked 3 months ago369 views
1 Answer
0

I came across your issue regarding the EC2 instance experiencing network drops and system lock-ups following an upgrade from CentOS 7 to AlmaLinux 8. It's a challenging situation, but there are a few steps you can take to potentially resolve this:

  • Network Interface Naming Convention: The upgrade to AlmaLinux 8 might have changed the naming convention for network interfaces from eth0 to something more hardware-specific. Running ip link will show you the current names, which you can then update in your network configuration to match.
  • Network Management Compatibility: AlmaLinux 8 prefers using NetworkManager, a shift from the older network scripts method. Ensure that your system's network management aligns with NetworkManager's expectations, or adjust your scripts to be compatible.
  • Review Network Configuration Files: The files in /etc/sysconfig/network-scripts/ might still reference eth0, which could no longer exist. Update these files to accurately reflect the interface names shown by ip link.
  • Update Custom and Legacy Scripts: Any scripts that were tailored for CentOS 7 might not be fully compatible with AlmaLinux 8. Review and update these scripts to ensure they're in line with the new system's requirements.
  • Verify Interface Recognition: Use ls /sys/class/net/ to list recognized network interfaces. If eth0 is missing, it indicates a need to adjust your configuration or check for driver issues.

These resources may help you:

  1. https://unix.stackexchange.com/questions/134483/why-is-my-ethernet-interface-called-enp0s10-instead-of-eth0
  2. https://wiki.crowncloud.net/?How_to_disable_NetworkManager_in_AlmaLinux_8#Installing+network-scripts
profile picture
EXPERT
answered 2 months ago
profile picture
EXPERT
reviewed a month ago
  • Thanks for the beautifully formatted and thorough reply. I'm not an expert on networking, but if it was a network interface issue wouldn't the behaviour be binary? ie. the network would either work or not work? - instead it works fine and then is suddently completely blocked.

    We seem to have traced the problem to some sort of AWS throttling feature. In particular AutoSSL with a larger number of domain names seem to trigger a complete block by AWS of all traffic to/from the EC2 instance past a certain threshold. Only a reboot fixes it. If AutoSSL is disabled, everything continues to work indefinitely. A few other people on the internet seem to have also hit this shadow limit under certain circumstances (not necessarily using AutoSSL but with similar traffic profiles).

    I did also work through your suggestions above, there does appear to be some sort of difference, eth0 and ens5 is mentioned. I'm very nervous about changing these without understanding the implications of getting them wrong. It appears NetworkManager is not running on the instance and Interface Recognition is using ens5

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions