- Newest
- Most votes
- Most comments
To provide a more granular RCA, you should distinguish between the two types of EC2 status failures, as they point to different ownership layers:
1. System vs. Instance Status Checks
-
System Status Check Failed: This usually indicates an issue with the AWS hardware or hypervisor. If this failed, the node's termination was likely unavoidable and due to an infrastructure fault. -> to be honest I never faced that cause in all my projects !!!
-
Instance Status Check Failed: This points to OS-level issues (Kernel panic, OOM, or Network stack hang). If the Kubelet stopped posting status before this failed, it suggests the OS was still running but the Kubelet was "starved" of resources.
-
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html
2. Resource Starvation:
If CloudWatch doesn't show an OOMKilled event, the Kubelet might have been sidelined by CPU Steal or Memory Fragmentation. Without explicit reservations, a runaway Pod can starve the Kubelet process itself, preventing it from sending the "heartbeat" to the Control Plane.
3. Advanced „Forensic“ Steps
Since the instance is terminated, check these specific AWS artifacts:
- EC2 Console Screenshot: AWS often captures the last frame of the VGA output. This is the only way to see a Kernel Panic or Kernel Oops that occurred too fast to be streamed to CloudWatch Logs.
- EBS Burst Balance: If you are using gp2 or small gp3 volumes, check if the IOPS/Throughput burst balance hit zero. An I/O hang will cause the Kubelet to stop responding immediately as it fails to write to its local cache.
4. Hardening the Node:
To prevent this in the future, implement Node Allocatable constraints. This ensures the OS and Kubelet always have a "lifeboat" of resources:
-
--kube-reserved: Reserve CPU/RAM specifically for Kubernetes agents.
-
--system-reserved: Reserve resources for OS background processes (systemd, journald).
-
Eviction Thresholds: Set --eviction-hard (e.g., memory.available<500Mi) so the Kubelet kills a Pod before the entire node freezes.
PS: One rare but possible cause for "Unknown" status in EKS is Certificate Expiration or Clock Skew on the worker node, which prevents the Kubelet from authenticating with the API server.
When a kubelet suddenly stops posting node status followed by EC2 status check failures and ASG termination, here's what you need to know:
Most Likely Root Causes:
The most common causes for this pattern include:
- Instance hardware degradation or failure - The underlying EC2 instance experienced a hardware issue that caused both the kubelet to stop responding and the EC2 status checks to fail
- Kernel panic or system-level crash - A critical system failure that prevented the kubelet process from continuing to report status
- Network connectivity issues - Loss of network connectivity preventing the kubelet from communicating with the control plane
- Resource exhaustion - While you didn't see obvious signs, severe CPU starvation or memory pressure could prevent the kubelet from functioning properly
- EC2 scheduled maintenance or retirement - The instance may have been flagged for retirement due to underlying hardware issues
Post-Incident Investigation (After Termination):
Since the instance is already terminated, your investigation options are limited but not impossible:
- Check EC2 instance history - Use the AWS CLI or console to look for any scheduled events, retirement notices, or system status check details that were recorded before termination
- Review Auto Scaling activity history - The ASG maintains a record of why instances were terminated, including the specific health check failure reason
- Examine CloudWatch metrics - Look at CPU, memory, disk I/O, and network metrics for the terminated instance leading up to the failure time
- Check VPC Flow Logs - If enabled, these can reveal network connectivity issues
- Review any captured container logs - Even though you didn't find a smoking gun, look for patterns of degradation (increasing latencies, timeouts, or errors) in the minutes before the incident
Prevention Measures:
To reduce the likelihood of similar incidents:
- Enable termination protection considerations - While this won't prevent ASG from replacing unhealthy instances, consider implementing custom health checks that are more nuanced
- Implement proactive monitoring - Set up CloudWatch alarms for kubelet health metrics, node conditions, and EC2 status check failures to catch issues earlier
- Configure pod disruption budgets - Ensure critical workloads can tolerate node failures gracefully
- Use diverse instance types - Spread your node groups across multiple instance types to reduce the impact of instance-family-specific issues
- Enable detailed monitoring - Ensure you have comprehensive logging and metrics collection so future incidents provide more forensic data
- Consider managed node groups - EKS managed node groups provide better integration with AWS health events and automated recovery
- Implement node problem detector - Deploy this Kubernetes add-on to surface node-level issues before they cause complete failures
- Regular cluster updates - Keep your EKS cluster, kubelet, and container runtime versions current to benefit from stability improvements
The pattern you observed—all node conditions transitioning to Unknown simultaneously followed by EC2 status check failure—strongly suggests an instance-level failure rather than a workload-induced problem. The ASG correctly identified the unhealthy instance and replaced it, which is the expected behavior for maintaining cluster health.
Sources
View health check status and the reason for health check failures - Amazon EC2 Auto Scaling
Incident response and forensics - Amazon EKS
Relevant content
- asked 2 years ago
- asked 3 years ago
- AWS OFFICIALUpdated a year ago

If proactive monitoring with Alarm is enabled, what would the best practice if the EC2 status check failures can be detected earlier? How to avoid the incident if it can be detected earlier?