Skip to content

Potential ENA regression on m7g (Ubuntu 22.04, kernel 6.8) in us-east-1/use1-az1 — requires reboot to recover

0

We are observing repeated ENA/network instability on the following configuration:

  • Instance type: m7g.4xlarge
  • Region/AZ: us-east-1 / use1-az1
  • AMI: Ubuntu 22.04 arm64
  • Kernel: 6.8.0-1044-aws

Symptoms under 6.8:

  • ENA link flaps followed by DHCP renew failure.
  • Loss of IPv4 and default route.
  • Instance becomes unreachable externally.
  • In some cases, load average spikes (>200) with hundreds of D-state tasks.
  • kworker/R-ena activity visible in kernel logs during failure.
  • Only reboot or stop/start recovers the instance.
  • Disabling GRO/GSO/TSO did not resolve the issue.

After downgrading to kernel: 5.15.0-aws the same workload has been stable for a week now.

This strongly suggests a regression specific to the 6.8 AWS kernel interacting with ENA on Graviton in this AZ.

We do not currently have a paid support plan, but this appears to be a potential kernel/ENA issue rather than an application configuration problem.

Could AWS confirm:

  • Whether this is a known issue with 6.8 on m7g?
  • Whether newer 6.8 builds contain ENA-related fixes?
  • Whether this is AZ/host-pool specific in use1-az1?
  • Whether ENA firmware differences could explain the behavior?

If this warrants escalation to the ENA/Nitro team, we would appreciate guidance.

Thank you.

asked a month ago39 views
1 Answer
1

Based on the symptoms you're describing—ENA link flaps, DHCP failures, loss of connectivity, high load averages with D-state tasks, and kworker/ENA activity—this does appear to be an ENA driver or kernel compatibility issue rather than a typical configuration problem.

The ENA device uses a keep-alive mechanism to monitor device health. When keep-alive messages aren't received, the driver initiates a reset procedure to recover from failures. During this reset, there can be brief traffic loss, though TCP connections should typically recover. The reset process involves the driver logging statistics, discarding incomplete packets, and reinitializing the device. However, if the underlying issue isn't resolved, these resets may not successfully restore connectivity.

For instances built on the AWS Nitro System (which includes m7g instances), specific ENA driver versions are recommended. The ENA Linux kernel driver version 2.2.9g or later is recommended for Nitro v4 instance types and required for Nitro v5 or later. Some Linux distributions may require minimum kernel versions to prevent suboptimal performance of ENA driver features on Nitro v4 and newer instance types. While Amazon Linux 2023 and Bottlerocket support ENA features for Nitro v4 and newer by default, other distributions' timing for incorporating ENA driver features can vary.

The fact that downgrading to kernel 5.15.0 resolved your issue strongly suggests a regression in the 6.8 kernel's ENA implementation or its interaction with the Graviton-based Nitro hardware. This could be related to how the 6.8 kernel version handles ENA driver features, device resets, or keep-alive mechanisms.

For issues like this that appear to be kernel or driver-related rather than configuration issues, and especially when they may be specific to certain instance types, kernel versions, or availability zones, escalation to AWS support would be appropriate. Even without a paid support plan, you can report potential service issues through AWS forums or by opening a case for service limit increases (which is available on free tier) and describing the technical issue there. You might also consider checking the Ubuntu kernel bug tracker or AWS-specific kernel repositories for any reported issues with the 6.8.0-1044-aws kernel and ENA on Graviton instances.
Sources
Troubleshoot the ENA kernel driver on Linux - Amazon Elastic Compute Cloud
Instances built on the AWS Nitro System - Amazon EC2

answered a month ago
EXPERT
reviewed 25 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.