Skip to content

EKS worker node failed EC2 status checks and was terminated by ASG

0

One of my EKS worker nodes (i-0f9) suddenly failed EC2 status checks and was terminated by the Auto Scaling Group (ASG), which then launched a replacement instance. From kubectl describe node (captured after the incident), I see that all node conditions flipped to Unknown at the same time: "conditions": [ { "type": "MemoryPressure", "status": "Unknown", "lastHeartbeatTime": "2026-03-29T05:21:43Z", "lastTransitionTime": "2026-03-29T05:27:19Z", "reason": "NodeStatusUnknown", "message": "Kubelet stopped posting node status." }, // same for DiskPressure, PIDPressure, and Ready ] This happened around 1:27 PM +08 (05:27 UTC), and the EC2 status check failure occurred shortly after ~1:36 PM and instance is terminated replaced with new instance by ASG. I have CloudWatch Container Insights enabled, but when I query the logs in /aws/ekscluster/pods and /aws/ekscluster/cluster log groups using the instance ID, I don't see any obvious smoking gun (no clear OOMKilled mentioning kubelet/containerd, no "disk full", no kernel panic, etc.) in the minutes leading up to 05:27 UTC. My questions:

  1. What is the most likely root cause when the kubelet suddenly stops posting node status, leading to EC2 status check failure and ASG termination?
  2. Since the instance is already terminated, how can I effectively find out what happened on that specific node?
  3. What prevention measures do you recommend to avoid this in the future? (e.g., better resource requests/limits, larger root volume, kubelet flags, custom health checks, etc.)
2 Answers
3

To provide a more granular RCA, you should distinguish between the two types of EC2 status failures, as they point to different ownership layers:

1. System vs. Instance Status Checks

  • System Status Check Failed: This usually indicates an issue with the AWS hardware or hypervisor. If this failed, the node's termination was likely unavoidable and due to an infrastructure fault. -> to be honest I never faced that cause in all my projects !!!

  • Instance Status Check Failed: This points to OS-level issues (Kernel panic, OOM, or Network stack hang). If the Kubelet stopped posting status before this failed, it suggests the OS was still running but the Kubelet was "starved" of resources.

  • https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html

2. Resource Starvation:

If CloudWatch doesn't show an OOMKilled event, the Kubelet might have been sidelined by CPU Steal or Memory Fragmentation. Without explicit reservations, a runaway Pod can starve the Kubelet process itself, preventing it from sending the "heartbeat" to the Control Plane.

3. Advanced „Forensic“ Steps

Since the instance is terminated, check these specific AWS artifacts:

  • EC2 Console Screenshot: AWS often captures the last frame of the VGA output. This is the only way to see a Kernel Panic or Kernel Oops that occurred too fast to be streamed to CloudWatch Logs.
  • EBS Burst Balance: If you are using gp2 or small gp3 volumes, check if the IOPS/Throughput burst balance hit zero. An I/O hang will cause the Kubelet to stop responding immediately as it fails to write to its local cache.

4. Hardening the Node:

To prevent this in the future, implement Node Allocatable constraints. This ensures the OS and Kubelet always have a "lifeboat" of resources:

PS: One rare but possible cause for "Unknown" status in EKS is Certificate Expiration or Clock Skew on the worker node, which prevents the Kubelet from authenticating with the API server.

EXPERT
answered 2 months ago
0

When a kubelet suddenly stops posting node status followed by EC2 status check failures and ASG termination, here's what you need to know:

Most Likely Root Causes:

The most common causes for this pattern include:

  • Instance hardware degradation or failure - The underlying EC2 instance experienced a hardware issue that caused both the kubelet to stop responding and the EC2 status checks to fail
  • Kernel panic or system-level crash - A critical system failure that prevented the kubelet process from continuing to report status
  • Network connectivity issues - Loss of network connectivity preventing the kubelet from communicating with the control plane
  • Resource exhaustion - While you didn't see obvious signs, severe CPU starvation or memory pressure could prevent the kubelet from functioning properly
  • EC2 scheduled maintenance or retirement - The instance may have been flagged for retirement due to underlying hardware issues

Post-Incident Investigation (After Termination):

Since the instance is already terminated, your investigation options are limited but not impossible:

  • Check EC2 instance history - Use the AWS CLI or console to look for any scheduled events, retirement notices, or system status check details that were recorded before termination
  • Review Auto Scaling activity history - The ASG maintains a record of why instances were terminated, including the specific health check failure reason
  • Examine CloudWatch metrics - Look at CPU, memory, disk I/O, and network metrics for the terminated instance leading up to the failure time
  • Check VPC Flow Logs - If enabled, these can reveal network connectivity issues
  • Review any captured container logs - Even though you didn't find a smoking gun, look for patterns of degradation (increasing latencies, timeouts, or errors) in the minutes before the incident

Prevention Measures:

To reduce the likelihood of similar incidents:

  • Enable termination protection considerations - While this won't prevent ASG from replacing unhealthy instances, consider implementing custom health checks that are more nuanced
  • Implement proactive monitoring - Set up CloudWatch alarms for kubelet health metrics, node conditions, and EC2 status check failures to catch issues earlier
  • Configure pod disruption budgets - Ensure critical workloads can tolerate node failures gracefully
  • Use diverse instance types - Spread your node groups across multiple instance types to reduce the impact of instance-family-specific issues
  • Enable detailed monitoring - Ensure you have comprehensive logging and metrics collection so future incidents provide more forensic data
  • Consider managed node groups - EKS managed node groups provide better integration with AWS health events and automated recovery
  • Implement node problem detector - Deploy this Kubernetes add-on to surface node-level issues before they cause complete failures
  • Regular cluster updates - Keep your EKS cluster, kubelet, and container runtime versions current to benefit from stability improvements

The pattern you observed—all node conditions transitioning to Unknown simultaneously followed by EC2 status check failure—strongly suggests an instance-level failure rather than a workload-induced problem. The ASG correctly identified the unhealthy instance and replaced it, which is the expected behavior for maintaining cluster health.
Sources
View health check status and the reason for health check failures - Amazon EC2 Auto Scaling
Incident response and forensics - Amazon EKS

answered 2 months ago
EXPERT
reviewed 2 months ago
  • If proactive monitoring with Alarm is enabled, what would the best practice if the EC2 status check failures can be detected earlier? How to avoid the incident if it can be detected earlier?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.