We are running EKS cluster (k8s v1.27) with 4 nodes. One of the things we use the cluster for is to host gitlab runners which spawn new pods with new pipelines. These are our most active pods that are being destroyed/created, while other pods remain running steadily.
What we started to see is that our gitlab jobs timeout after 10 mins (our timeout setting) due to the fact that a new pod cannot be scheduled and initialized on a node.
After further digging into the issue, we notice that aws-node pods are having issues and restart and during that restart no new gitlab runner pods can be assigned to a node.
What we also see is that aws-node container within aws-node pod is the one restarting and that one is currently using the following image (amazon-k8s-cni:v1.18.0-eksbuild.1)
Along with aws-node pod we see ebs-csi-node also restarting and a container inside it, node-driver-registrar (csi-node-driver-registrar:v2.10.0-eks-1-29-7) goes into a CrashLoopBackOff state.
Some posts suggest that upgrading to the version of CNI we are running caused issues like this and downgrading solved it. I also see here (https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html) that it is suggested to run v1.18.2-eksbuild.1
The issue comes and goes and possibly manifests itself when we have higher demand on creating new pods, but not consistent.
We are not running more pods than max (110 per node) and we have the following settings for our VPC CNI
ENABLE_PREFIX_DELEGATION = "true"
WARM_PREFIX_TARGET = "1"