Zombie pods in EKS

0

Hi all. we have recently started experiencing an increase in zombie pods when terminating them, is anyone aware what is the root cause of a pod being a zombie/stuck on terminating state? This is the error we keep on getting: error killing pod: failed to "KillContainer" for "zombie-pod" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: cannot stop container: 803b8598080nbdkau8i0n2526be67302a3748dbcbe3066ad0fae55707d1: container 803b8598080 PID 14597 is zombie and can not be killed. Use the --init option when creating containers to run an init inside the container that forwards signals and reaps processes"

1 個回答
0
已接受的答案

Zombie pods are usually caused by containers that have Zombie processes that won't stop. If you've recently experienced more of these than usual, then I would look at what has changed in the applications/processes that you are running in the containers. The --init option is a Docker setting, that sets the ENTRYPOINT to tini. This is an init processes that becomes PID 1, and then handles your apps as process children. This is usually done when signals (SIGTERM, SIGKILL) are not being properly handled by applications. There was another option with dumb-init from Yelp.

It is always a good idea to make sure that processes, especially PID 1, will properly handle signals. Most of the time this is a non-issue, however, several things can cause processes to enter Zombie states, like duplicated calls, improper error handling, nested calls, especially with bash.

In troubleshooting this issue, the first thing I would make sure of is that your application properly handles signals, and decide if you need to update the signal handling or even use a separate init process. Is your application or process created orphaned processes (processes that have lost connection to the parent process)?

profile pictureAWS
已回答 1 年前
profile picture
專家
已審閱 10 個月前
  • Thanks for your response. Yes our application is an orphaned process and it was going into this state because we were terminate the parent application first.

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南