When Kubernetes nodes start up, there can be a race condition where pods are scheduled before the kubelet gets a server certificate. This can cause issues for some workloads like GitLab runner, resulting in TLS errors for GitLab worker/executor pods. The aws-samples project "Kubelet Server Certificate Untaint" shows a solution for these kind of workloads.
This project is a reference implementation of a Kubernetes DaemonSet that automatically removes a taint from nodes once the kubelet server certificate is available. It can be applied to an Amazon Elastic Kubernetes Service (EKS) cluster or any Kubernetes (K8s) cluster.
Background
When Kubernetes nodes start up, there can be a race condition where pods are scheduled before the kubelet gets a server certificate. This can cause issues for some workloads like GitLab runner, resulting in TLS errors for GitLab worker/executor pods.
The aws-samples project Kubelet Server Certificate Untaint solves the problem by:
- Waiting for the kubelet server certificate to exist (/var/lib/kubelet/pki/kubelet-server-current.pem)
- Removing a configurable taint from the node once the certificate is present
- Allowing safe pod scheduling only after the kubelet is fully ready
Users can apply a taint (e.g., example.com/kubelet-no-server-cert:NoSchedule via Karpenter Nodepool startupTaints) to nodes at startup, and this DaemonSet will automatically remove it once the kubelet certificate is available.
kubelet internals
After kubelet start up it creates a K8s CertificateSigningRequest which is eventually approved and signed by an EKS control plane component called eks-certificate-controller. This can take up to 60 seconds. The signed certificate is then stored in node OS under /var/lib/kubelet/pki/kubelet-server-current.pem which is a soft-link to a time-stamped PEM file in the same directory. For commands like kubectl exec/logs kube-apiserver initiates a connection to the kubelet via TLS which requires this kubelet server certificate.
In EKS Auto Mode Container Network Interface (CNI) is running as a systemd service and there is no need to start an aws-node pod. containerd can notify kubelet about CRI and CNI Ready condition almost immediately, causing kubelet posting Ready status very fast, approximately within 2s after start up, almost at same time when kubelet creates the CertificateSigningRequest.
Pods can then scheduled to the Ready node and start running. This increases the likelihood that a missing kubelet server certificate causes TLS issues.
In none EKS Auto Mode it takes about 20..25s for kubelet to get Ready. Usually the kubelet server certificate is already available at that point in time.
Installation and building
The DaemonSet is developed in Golang and the corresponding container image available in Amazon ECR Public Gallery.
The project provides sample YAML files for installation using kubectl apply as well as a Helm chart for GitOps usage.
A felixibel Makefile provides various targets fo rbuilding the binary, container image and installation.