c7a.medium Ubuntu LTS EC2 machines loses network connection shortly after spike in CPU or memory load

0

We often experience a c7a.medium machine losing its network connection shortly after a spike in CPU or memory load. Alas, we couldn't replicate it reliably, so we are unsure if the CPU load spike or memory pressure causes the problem. The network connection to the machine is completely lost, and only a forced restart recovers the instance. We know that the machine keeps on running after the network connection dropped, as journalctl still has logs for that timeframe.

Interestingly, this sometimes happens on snapd automatic refresh, but we were also able to see that happening after unattended upgrades or when a flood of network traffic hit that machine. Usually, we also see that snapd times out (systemd watchdog kills it) and fails to restart. We also found reports of systemd-networkd failing to renew DHCP leases, but the machine was unreachable already for some time at that point.

asked 15 days ago36 views
1 Answer
0

The issue you're experiencing with your c7a.medium Ubuntu LTS EC2 instances losing network connectivity after CPU or memory spikes is a complex problem that could have multiple causes. Based on the information provided, here are some potential explanations and troubleshooting steps:

  1. Resource Exhaustion: The CPU or memory spikes might be causing resource exhaustion, leading to network connectivity issues. This could be particularly problematic if critical system processes are being starved of resources.

  2. DHCP Issues: The reports of systemd-networkd failing to renew DHCP leases suggest that there might be a problem with the DHCP client on your instances. This could be exacerbated by high system load.

  3. Kernel or Driver Issues: There might be underlying kernel or network driver issues that manifest under high load conditions.

  4. snapd and Unattended Upgrades: The correlation with snapd refreshes and unattended upgrades suggests that these processes might be triggering the issue, possibly due to resource contention or conflicts with network-related services.

To troubleshoot and potentially resolve this issue:

  1. Monitor Resource Usage: Use tools like CloudWatch to closely monitor CPU, memory, and network usage. Set up alarms to alert you when these metrics spike.

  2. Check System Logs: Examine logs such as /var/log/syslog, /var/log/dmesg, and journalctl output for any error messages or warnings that occur just before or during the network outages.

  3. Update the System: Ensure that your Ubuntu LTS version and all packages are up to date. There might be known issues that have been resolved in newer versions.

  4. Adjust Network Settings: Consider tweaking network-related settings, such as increasing the DHCP lease renewal time or adjusting network buffer sizes.

  5. Modify snapd and Unattended Upgrades: You might want to adjust the scheduling of snapd refreshes and unattended upgrades to times of lower system load, or temporarily disable them to see if it resolves the issue.

  6. Instance Type Consideration: If the problem persists, you might want to consider testing with a different instance type to see if the issue is specific to c7a.medium instances.

  7. AWS Support: If the problem continues after trying these steps, it may be worth engaging AWS support, especially if you suspect it might be related to the underlying EC2 infrastructure.

Remember to document all your troubleshooting steps and their outcomes. This will be valuable information if you need to escalate the issue to AWS support or if you encounter similar problems in the future.
Sources
My E2 instance network connection and stability | AWS re:Post
CPU spiked and eth0: Failed | AWS re:Post
EC2 Instance Reachability Check Failure | AWS re:Post

profile picture
answered 15 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions