How to debug why EKS nodes are pegging CPU at 100% and going to 'Not Ready' ?

0

Configuration and Problem Statement

Our org has an EKS deployment running v1.25 with 4 nodes. Every so often, a couple of the nodes will go to Not Ready at the same time. Describing the nodes shows that the kubelet stopped posting status and CloudWatch logs show that the previous statuses show no disk, memory, or PID pressure at the status update just before the node goes down.

What We've Observed

When I look in the EC2 console at the instances that go to 'Not Ready' I see that the CPU suddenly spikes from ~30% to pegged at 100%, but we did not have a commensurate increase in application load that would cause that, nor did the pods on that node report any error or faults. At the time these nodes go Not Ready, it seems the system is running normally with nothing extraordinary happening. We cannot log into the affected nodes because there's not enough spare CPU cycles to support login (can't get past an initial banner). This seems to be happening more often since upgrading to 1.25 from 1.24, but that is just anecdotal at this point since we've just upgraded a couple weeks back.

Question

The question is: how to diagnose what is causing these nodes to suddenly spike in CPU usage and become unresponsive?

답변 없음

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠