How to debug why EKS nodes are pegging CPU at 100% and going to 'Not Ready' ?

0

Configuration and Problem Statement

Our org has an EKS deployment running v1.25 with 4 nodes. Every so often, a couple of the nodes will go to Not Ready at the same time. Describing the nodes shows that the kubelet stopped posting status and CloudWatch logs show that the previous statuses show no disk, memory, or PID pressure at the status update just before the node goes down.

What We've Observed

When I look in the EC2 console at the instances that go to 'Not Ready' I see that the CPU suddenly spikes from ~30% to pegged at 100%, but we did not have a commensurate increase in application load that would cause that, nor did the pods on that node report any error or faults. At the time these nodes go Not Ready, it seems the system is running normally with nothing extraordinary happening. We cannot log into the affected nodes because there's not enough spare CPU cycles to support login (can't get past an initial banner). This seems to be happening more often since upgrading to 1.25 from 1.24, but that is just anecdotal at this point since we've just upgraded a couple weeks back.

Question

The question is: how to diagnose what is causing these nodes to suddenly spike in CPU usage and become unresponsive?

No Answers

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions