liveness and readiness timeouts


We are currently running EKS 1.24 with amazon-k8s-cni-init:v1.12.6 and amazon-k8s-cni:v1.12.6. We have a problem with several application pods continuously undergoing restarts. Upon close inspection it appears the pods are getting terminated with the below reason:

Last State: Terminated Reason: OOMKilled Exit Code: 137

While this is happening, we are putting efforts to check if there are any memory leaks in the application. However we also notice there are errors/warning events in the namespace where the app is deployed:

Warning Unhealthy pod/<pod-name> Readiness probe failed: Get "http://<url>": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Warning Unhealthy pod/<pod-name> Liveness probe failed: Get "http://<url>": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

The current timeout in seconds is set to 1 second (both readiness and liveness). The question is what should the value be set ideally? Are there any cases where the pods were killed and restarted because of premature liveness probes timeout? Additionally have there been any known cases where memory utilization and OOM has caused the liveness probes to fail? (as there are chances OOM can prevent requests from creating additional sockets)

asked 3 months ago151 views
1 Answer

Hello hvb,

It is possible that your liveness/readiness probes are timing out prematurely before your application is fully in Ready state (started responding to health checks). The timeout setting for your liveness/readiness probes have to be decided based on your application expected performance.

Increase the timeoutSeconds period and see if the probes are successful. If they are, you can conclude that the reason for the timeouts is the result of probes prematurely timing out, and figure out why your application is unable to respond within the duration expected. If the probes still fail after increasing the timeoutSeconds, it could be result of another underlying problem, and has to be dealt separately.

You can also try to increase the initialDelaySeconds parameter to provide enough time for your application to startup, before starting the probes.

I hope this info is helpful to you. Please comment if you have further questions, and I will be happy to help!

profile pictureAWS
answered 2 months ago
  • Thanks for the update. To my other question, do you know if there are any known occurrences of readiness/liveness timeouts within EKS due to the target containers being out of memory?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions