"Unknown" node status on EKS

0

Has anyone else been encountering "Unknown" statuses for nodes in EKS recently? We've had 3 occurrences in the last few days and in each case, pods on those nodes have become "stuck". Draining the node, restarting deployment/statefulsets, and deleting the pod have no effect. The only solution has been to manually delete the node which allows those pods to get re-scheduled onto another healthy node.

One theory is that the pods on the bad nodes show as "Running" because that is the last status reported before the node became unresponsive.

In any case, this seems to fundamentally break EKS's ability to manage compute resources and requires manual intervention each time to get the cluster back into a healthy state.

  • Get information on the worker node by running the following command: $ kubectl describe node node-name Example:

    KubeletNotReady PLEG is not healthy: pleg was last seen active xx The most common reasons for PLEG being unhealthy are the following:

    Kubelet can't communicate with Docker daemon because the daemon is busy or dead. For example, the Docker daemon on your EKS worker node might be broken. An out of memory (OOM) or CPU utilization issue at instance level caused PLEG to become unhealthy. If the worker node has a large number of pods, the kubelet and Docker daemon might experience higher workloads, causing PLEG related errors. Higher workloads might also result if the liveness or readiness probes frequently. Check the kubelet logs You can check the kubelet logs on the instance to identify why PLEG is unhealthy.

    1. Use SSH to connect to the instance and run the following command: $ journalctl -u kubelet > kubelet.log If you're using the Amazon EKS-optimized AMI and SSH isn't enabled, then you can connect using SSM. For more information, see Connect to your Linux instance using Session Manager.

    2. Check for PLEG related events posted by kubelet in these logs:

    Example:

    28317 kubelet.go:1873] "Skipping pod synchronization" err="PLEG is not healthy: pleg was last seen active 4h5m43.659028227s ago; threshold is 3m0s" 28317 kubelet.go:1873] "Skipping pod synchronization" err="PLEG is not healthy: pleg was last seen active 4h5m48.659213288s ago; threshold is 3m0s"

  • Thanks for the tips, Hemant. We didn't see any events or logs related to PLEG. Just a vague "Kubelet stopped posting node status." message in all the conditions and we weren't able to connect via ssh for some reason last time we tried. The ssh connectivity may be unrelated to the node status though.

Sam K
asked a year ago155 views
No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions