How to debug why EKS nodes are pegging CPU at 100% and going to 'Not Ready' ?

0

Configuration and Problem Statement

Our org has an EKS deployment running v1.25 with 4 nodes. Every so often, a couple of the nodes will go to Not Ready at the same time. Describing the nodes shows that the kubelet stopped posting status and CloudWatch logs show that the previous statuses show no disk, memory, or PID pressure at the status update just before the node goes down.

What We've Observed

When I look in the EC2 console at the instances that go to 'Not Ready' I see that the CPU suddenly spikes from ~30% to pegged at 100%, but we did not have a commensurate increase in application load that would cause that, nor did the pods on that node report any error or faults. At the time these nodes go Not Ready, it seems the system is running normally with nothing extraordinary happening. We cannot log into the affected nodes because there's not enough spare CPU cycles to support login (can't get past an initial banner). This seems to be happening more often since upgrading to 1.25 from 1.24, but that is just anecdotal at this point since we've just upgraded a couple weeks back.

Question

The question is: how to diagnose what is causing these nodes to suddenly spike in CPU usage and become unresponsive?

Nessuna risposta

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande