Kubernetes pods are being evicted and their node crashing all of a sudden, and not sure why?

0

Hi everyone, We've been using EKS for about a year now, and last week we started seeing pods evicted from some of our nodes -- to the point where no pods were available to service our clients-- and we can't figure out why.

This happens every other day or so, and takes us down from anywhere from 2-15 minutes.

If you see the screenshot below you'll notice a lot of failed 'api' pods. If you look at the logs you'll see messages like this:

"status": {
    "phase": "Failed",
    "message": "The node was low on resource: memory. Container api was using 459860Ki, which exceeds its request of 256Mi. ",
    "reason": "Evicted",
    "startTime": "2022-07-26T12:31:37Z"

To the best of our knowledge the code in those pods hasn't changed recently, and so we're not sure why the memory is running low all of a sudden; nor are we sure why they appear to be sticking around when they've failed / been evicted, as can also be seen in the screenshot.

We're really most interested in keeping our service up when something like this happens and were wondering what we can do to prevent all pods from being down for the count, and the node maxxed out and unable to create new ones for a period of time. Any suggestions on how we can diagnose this further, and / or lessen the chances of being completely out of luck when it does happen, would be greatly appreciated.

(Our AMI release on the node is 1.21.2-20210830)

Thanks!

Enter image description here

larryq
asked 2 years ago1293 views
1 Answer
0

Hello,

Your kubelet is evicting your pod to reclaim memory on your node to prevent memory starvation for other processes on the node.

The node pressure eviction process is selecting this particular pod for eviction because the pod's current memory usage at that particular time was more than the memory request specified in the container spec.

According to this doc, the kubelet uses the following parameters to determine the pod eviction order during the eviction process:

  1. Whether the pod's resource usage exceeds requests
  2. Pod Priority
  3. The pod's resource usage relative to requests

To remediate this problem, you could request for additional memory during the pod creation to make sure that your pod memory utilization stays below the memory request value.

You can also set memory limits for all of the containers in your cluster to stop them from taking up more memory than they should. If you are unsure about the right requests/limits, VPA can be helpful.

You will have to monitor all the processes running in your nodes to find out what processes are taking up excessive memory in your nodes.

I hope this is helpful. Please comment if you have any further questions. Thanks!

profile pictureAWS
SUPPORT ENGINEER
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions