EKS 1.21 cluster no longer scheduling pods

0

Issue summary : EKS 1.21 cluster newly launched nodes not becoming Ready and the existing ones not scheduling pods

Since yesterday evening - 5-6 CET+1 we are unable to schedule pods and new nodes never join the cluster as they are unable to schedule and run pods for aws-node,Core-dns ..etc.

Attempts to fix : 1-Scaling node group 2-Adding new node group Nodes not becoming ready , since we have no changes except our normal applications deployments using flux on the cluster we didn't change any security or configuration of the cluster or nodes.

Logs:

will post a snippet of the control plane logs

E0216 13:35:13.581039      10 scheduler.go:344] "Error updating pod" err="the server was unable to return a response in the time allotted, but may still be processing the request (patch pods test-pod-prasad)" pod="cluster-services/test-pod-prasad"
E0216 13:36:13.583815      10 framework.go:898] "Failed running Bind plugin" err="the server was unable to return a response in the time allotted, but may still be processing the request (post pods test-pod)" plugin="DefaultBinder" pod="cluster-services/test-pod"
I0216 13:36:13.583860      10 scheduler.go:435] "Failed to bind pod" pod="cluster-services/test-pod"
E0216 13:36:13.583913      10 factory.go:355] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": the server was unable to return a response in the time allotted, but may still be processing the request (post pods test-pod)" pod="cluster-services/test-pod"
E0216 13:36:43.597058      10 framework.go:898] "Failed running Bind plugin" err="the server was unable to return a response in the time allotted, but may still be processing the request (post pods test-pod-prasad)" plugin="DefaultBinder" pod="cluster-services/test-pod-prasad"
I0216 13:36:43.597095      10 scheduler.go:435] "Failed to bind pod" pod="cluster-services/test-pod-prasad"
E0216 13:36:43.597134      10 factory.go:355] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": the server was unable to return a response in the time allotted, but may still be processing the request (post pods test-pod-prasad)" pod="cluster-services/test-pod-prasad"
E0216 13:37:13.585621      10 scheduler.go:344] "Error updating pod" err="the server was unable to return a response in the time allotted, but may still be processing the request (patch pods test-pod)" pod="cluster-services/test-pod"
E0216 13:37:24.473084      10 framework.go:898] "Failed running Bind plugin" err="the server was unable to return a response in the time allotted, but may still be processing the request (post pods test-pod)" plugin="DefaultBinder" pod="cluster-services/test-pod"
I0216 13:37:24.473144      10 scheduler.go:435] "Failed to bind pod" pod="cluster-services/test-pod"
E0216 13:37:24.473189      10 factory.go:355] "Error scheduling pod; retrying" err="binding rejected: running Bind plugin \"DefaultBinder\": the server was unable to return a response in the time allotted, but may still be processing the request (post pods test-pod)" pod="cluster-services/test-pod"
E0216 13:37:43.605065      10 scheduler.go:344] "Error updating pod" err="the server was unable to return a response in the time allotted, but may still be processing the request 

asked a year ago518 views
4 Answers
0

Hi,

I gotta admit at first glance, I was not interested at all.... and for a moment I was tempted to tell you run kubectl describe node <nodename> and look for a condition over the nodes, but then I realized this question is loaded with some deep interesting things, the first one I notice and the one that brought my attention was:

binding rejected: running Bind plugin \"DefaultBinder\"

And then I was thinking, what a heck is that! So I found the function, because, well, why not?

https://pkg.go.dev/k8s.io/kubernetes/pkg/scheduler/framework/plugins/defaultbinder#DefaultBinder.Bind

Yes, that beauty has a function, called "Bind" at first glance, you get lost and think, "Damn, what a heck (again)", but looking closer you see:

state *framework.CycleState

Now, this one has a description like this:

CycleState provides a mechanism for plugins to store and retrieve arbitrary data. StateData stored by one plugin can be read, altered, or deleted by another plugin. CycleState does not provide any data protection, as all plugins are assumed to be trusted. Note: CycleState uses a sync.Map to back the storage. It's aimed to optimize for the "write once and read many times" scenarios. It is the recommended pattern used in all in-tree plugins - plugin-specific state is written once in PreFilter/PreScore and afterwards read many times in Filter/Score.

So, I was thinking, damn, this has to be the place where the Plugins store the NodeInfo! or something, but the info must come from somewhere :) - Kubeapi perhaps?

So, I get curious about the second statement in the log:

the server was unable to return a response in the time allotted, but may still be processing the request

At kubernetes repo, v1.21.0 in the file:

vendor/k8s.io/apimachinery/pkg/api/errors/errors.go

https://github.com/kubernetes/kubernetes/blob/v1.21.0/staging/src/k8s.io/apimachinery/pkg/api/errors/errors.go#L435

you can see the error msg under the case for "StatusGatewayTimeout"

433         case http.StatusGatewayTimeout:
434                 reason = metav1.StatusReasonTimeout
435                 message = "the server was unable to return a response in the time allotted, but may still be processing the request"

One of the cases of the function:

https://github.com/kubernetes/kubernetes/blob/v1.21.0/staging/src/k8s.io/apimachinery/pkg/api/errors/errors.go#L387

387 // NewGenericServerResponse returns a new error for server responses that are not in a recognizable form.
388 func NewGenericServerResponse(code int, verb string, qualifiedResource schema.GroupResource, name, serverMessage string, retryAfterSeconds int, isUnexpectedResponse bool) *StatusError {
389         reason := metav1.StatusReasonUnknown
390         message := fmt.Sprintf("the server responded with the status code %d but did not return more information", code)
391         switch code {

So, this means the pod is trying to reach the API at the Control PLane, and it can't. With this in mind, I remembered an article about "Check the network configuration between nodes and the control plane" from

https://aws.amazon.com/premiumsupport/knowledge-center/eks-node-status-ready/

And there you can find a few things to check:

  1. Confirm that there are no network access control list (ACL) rules on your subnets blocking traffic between the Amazon EKS control plane and your worker nodes.

  2. Confirm that the security groups for your control plane and nodes comply with minimum inbound and outbound requirements. <<----- THIS ONE IS IMPORTANT!

    Check https://docs.aws.amazon.com/eks/latest/userguide/sec-group-reqs.html and add the tags if they are missing from the Security Group

  3. (Optional) If your nodes are configured to use a proxy, confirm that the proxy is allowing traffic to the API server endpoints.

  4. To verify that the node has access to the API server, run the following netcat command from inside the worker node:

$ nc -vz 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com 443
Connection to 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com 443 port [tcp/https] succeeded!

Important: Replace 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com with your API server endpoint.

  1. Check that the route tables are configured correctly to allow communication with the API server endpoint through either an internet gateway or NAT gateway. If the cluster makes use of PrivateOnly networking, verify that the VPC endpoints are configured correctly.

I hope this helps to find your issue, and you are able to solve it. I really hope is just missing tags :)

profile picture
answered a year ago
0

Thank you for your answer, surely there were no changes in ACLs/SGs/Tags , we are using terraform and track drift on a daily basis concisely and no changes happened.

Point 4. fails , but were able to curl , currently we scaled down the cluster and kept the control plane to investigate the issue, we are not using proxy but the cluster endpoint is private , everything works fine on 2 other clusters with similar configurations ( prod/stage even with more strict security) .

The issue is beyond Node <> Control Plane connectivity , its related to scheduling any pod on cluster and we have issues deleting namespaces even after deleting all workers and node groups

still deleting a namespace since the issue ( since last week) kubectl get ns E0222 10:35:02.514614 39755 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0222 10:35:02.573107 39755 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0222 10:35:02.607258 39755 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0222 10:35:02.630232 39755 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

answered a year ago
0

Hi,

In that case then it could be a finalizers problem when trying to delete namespace.

  1. save the namespace object into a json
kubectl get namespace <terminating-namespace> -o json > tempfile.json
  1. remove finalizers block
"spec": {
        "finalizers": [
            "kubernetes"
        ]
    }

After removed

"spec" : {
    }
  1. Apply, make sure to replace <terminating-namespace>
kubectl replace --raw "/api/v1/namespaces/<terminating-namespace>/finalize" -f ./tempfile.json

good luck!

profile picture
answered a year ago
0

The deletion of the namespaces is not an issue here, but I reported to mentioned that cluster control plane didn't allow adding nodes for 2 weeks and we were not able to explain why , the issue is summarised as below :

1-We had over 20 nodes that were ready and working (managed nodes) 2-Autoscaler was crashing as it was unable to scale 3-New nodes not becoming ready 4-Node kubelet logs shows that authentication is fine to api , and after that CNI plugin not ready even nothing changed on the network side ( everything is created via terraform and we are notified if every little drift) 5-Pods not getting scheduled 6-Scheduler logs shows the logs as I pasted previously.

After letting the cluster unused for 2 weeks without any nodes , everything went back to normal and I was able to node groups and they were getting ready , the only different that when testing with self-managed nodes they seemed to fail , but managed nodes of 2 different AWS AMIs ( older and latest) worked fine

So something has been fixed or changed on control plane side ? Unfortunately we deleted the cluster as during testing we lost access by applying a configmap which also replaced the existing one :$ :( .

The issue remains a mystery , me and 2 other engineers worked on troubleshooting the issue , including that fact that I'm an ex AWS support engineer :S :|

Thanks for your input

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions