- Newest
- Most votes
- Most comments
Hi,
I gotta admit at first glance, I was not interested at all.... and for a moment I was tempted to tell you run kubectl describe node <nodename>
and look for a condition over the nodes, but then I realized this question is loaded with some deep interesting things, the first one I notice and the one that brought my attention was:
binding rejected: running Bind plugin \"DefaultBinder\"
And then I was thinking, what a heck is that! So I found the function, because, well, why not?
Yes, that beauty has a function, called "Bind" at first glance, you get lost and think, "Damn, what a heck (again)", but looking closer you see:
state *framework.CycleState
Now, this one has a description like this:
CycleState provides a mechanism for plugins to store and retrieve arbitrary data. StateData stored by one plugin can be read, altered, or deleted by another plugin. CycleState does not provide any data protection, as all plugins are assumed to be trusted. Note: CycleState uses a sync.Map to back the storage. It's aimed to optimize for the "write once and read many times" scenarios. It is the recommended pattern used in all in-tree plugins - plugin-specific state is written once in PreFilter/PreScore and afterwards read many times in Filter/Score.
So, I was thinking, damn, this has to be the place where the Plugins store the NodeInfo! or something, but the info must come from somewhere :) - Kubeapi perhaps?
So, I get curious about the second statement in the log:
the server was unable to return a response in the time allotted, but may still be processing the request
At kubernetes repo, v1.21.0 in the file:
vendor/k8s.io/apimachinery/pkg/api/errors/errors.go
you can see the error msg under the case for "StatusGatewayTimeout"
433 case http.StatusGatewayTimeout:
434 reason = metav1.StatusReasonTimeout
435 message = "the server was unable to return a response in the time allotted, but may still be processing the request"
One of the cases of the function:
387 // NewGenericServerResponse returns a new error for server responses that are not in a recognizable form.
388 func NewGenericServerResponse(code int, verb string, qualifiedResource schema.GroupResource, name, serverMessage string, retryAfterSeconds int, isUnexpectedResponse bool) *StatusError {
389 reason := metav1.StatusReasonUnknown
390 message := fmt.Sprintf("the server responded with the status code %d but did not return more information", code)
391 switch code {
So, this means the pod is trying to reach the API at the Control PLane, and it can't. With this in mind, I remembered an article about "Check the network configuration between nodes and the control plane" from
https://aws.amazon.com/premiumsupport/knowledge-center/eks-node-status-ready/
And there you can find a few things to check:
-
Confirm that there are no network access control list (ACL) rules on your subnets blocking traffic between the Amazon EKS control plane and your worker nodes.
-
Confirm that the security groups for your control plane and nodes comply with minimum inbound and outbound requirements. <<----- THIS ONE IS IMPORTANT!
Check https://docs.aws.amazon.com/eks/latest/userguide/sec-group-reqs.html and add the tags if they are missing from the Security Group
-
(Optional) If your nodes are configured to use a proxy, confirm that the proxy is allowing traffic to the API server endpoints.
-
To verify that the node has access to the API server, run the following netcat command from inside the worker node:
$ nc -vz 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com 443
Connection to 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com 443 port [tcp/https] succeeded!
Important: Replace 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com with your API server endpoint.
- Check that the route tables are configured correctly to allow communication with the API server endpoint through either an internet gateway or NAT gateway. If the cluster makes use of PrivateOnly networking, verify that the VPC endpoints are configured correctly.
I hope this helps to find your issue, and you are able to solve it. I really hope is just missing tags :)
Thank you for your answer, surely there were no changes in ACLs/SGs/Tags , we are using terraform and track drift on a daily basis concisely and no changes happened.
Point 4. fails , but were able to curl , currently we scaled down the cluster and kept the control plane to investigate the issue, we are not using proxy but the cluster endpoint is private , everything works fine on 2 other clusters with similar configurations ( prod/stage even with more strict security) .
The issue is beyond Node <> Control Plane connectivity , its related to scheduling any pod on cluster and we have issues deleting namespaces even after deleting all workers and node groups
still deleting a namespace since the issue ( since last week)
kubectl get ns E0222 10:35:02.514614 39755 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0222 10:35:02.573107 39755 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0222 10:35:02.607258 39755 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0222 10:35:02.630232 39755 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
Hi,
In that case then it could be a finalizers problem when trying to delete namespace.
- save the namespace object into a json
kubectl get namespace <terminating-namespace> -o json > tempfile.json
- remove finalizers block
"spec": {
"finalizers": [
"kubernetes"
]
}
After removed
"spec" : {
}
- Apply, make sure to replace <terminating-namespace>
kubectl replace --raw "/api/v1/namespaces/<terminating-namespace>/finalize" -f ./tempfile.json
good luck!
The deletion of the namespaces is not an issue here, but I reported to mentioned that cluster control plane didn't allow adding nodes for 2 weeks and we were not able to explain why , the issue is summarised as below :
1-We had over 20 nodes that were ready and working (managed nodes) 2-Autoscaler was crashing as it was unable to scale 3-New nodes not becoming ready 4-Node kubelet logs shows that authentication is fine to api , and after that CNI plugin not ready even nothing changed on the network side ( everything is created via terraform and we are notified if every little drift) 5-Pods not getting scheduled 6-Scheduler logs shows the logs as I pasted previously.
After letting the cluster unused for 2 weeks without any nodes , everything went back to normal and I was able to node groups and they were getting ready , the only different that when testing with self-managed nodes they seemed to fail , but managed nodes of 2 different AWS AMIs ( older and latest) worked fine
So something has been fixed or changed on control plane side ? Unfortunately we deleted the cluster as during testing we lost access by applying a configmap which also replaced the existing one :$ :( .
The issue remains a mystery , me and 2 other engineers worked on troubleshooting the issue , including that fact that I'm an ex AWS support engineer :S :|
Thanks for your input
Relevant content
- asked 10 months ago
- asked 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 2 years ago