Upgrade of AWS EKS Node group failed with 'CNI plugin not initialized'

0

I'm running an AWS EKS Cluster with a Node group consisting. The cluster is on version 1.25. The nodes are on AMI 1.23.9-20220926.

When updating the AMI to 1.25.16-20240514 it fails with Error code "NodeCreationFailure" and message "Couldn't proceed with upgrade process as new nodes are not joining node group my-ng".

During the update 2 new nodes are started.

Executing

sudo tail -f /var/log/messages on the new node show the following error:

May 30 06:37:26 ip-10-1-13-236 kubelet: E0530 06:37:26.793999 2963 pod_workers.go:965] "Error syncing pod, skipping" err="network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized" pod="kube-system/ebs-csi-node-XXXXX" podUID=XXXXX May 30 06:37:26 ip-10-1-13-236 kubelet: E0530 06:37:26.806361 2963 kubelet.go:2399] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized" May 30 06:37:26 ip-10-1-13-236 kubelet: I0530 06:37:26.897099 2963 prober.go:114] "Probe failed" probeType="Readiness" pod="kube-system/aws-node-XXXXX" podUID=XXXXX containerName="aws-node" probeResult=failure output=< May 30 06:37:26 ip-10-1-13-236 kubelet: {"level":"info","ts":"2024-05-30T06:37:26.893Z","caller":"/usr/local/go/src/runtime/proc.go:267","msg":"timeout: failed to connect service ":50051" within 5s"}

My Amazon VPC CNI Add-on is in status active with version v1.18.1-eksbuild.3, Amazon EBS CSI Driver is also active with version v.1.31.0-eksbuild.1

The newly created nodes disappear after some minutes, the node group remains on the old AMI. The update status show an error

NodeCreationFailurev Couldn't proceed with upgrade process as new nodes are not joining node group my-ng Any help is highly appreciated.

asked a year ago2.6K views
3 Answers
0

Thanks for your reply. AWSSupport-TroubleshootEKSWorkerNode succeeds with warning: "No secondary private IP addresses are assigned to worker node i-XYZ, ensure that the CNI plugin is running properly."

My aws-node-pods have a failing Liveness and Readyness probe. I've extended the timeout from 5s to 10s in the daemonset. That didn't fix.

 Warning  Unhealthy  42s                kubelet            Readiness probe failed: {"level":"info","ts":"2024-05-30T10:16:56.989Z","caller":"/usr/local/go/src/runtime/proc.go:267","msg":"timeout: failed to connect service \":50051\" within 10s"}
  Warning  Unhealthy  21s                kubelet            Readiness probe failed: {"level":"info","ts":"2024-05-30T10:17:17.088Z","caller":"/usr/local/go/src/runtime/proc.go:267","msg":"timeout: failed to connect service \":50051\" within 10s"}
  Warning  Unhealthy  11s (x6 over 82s)  kubelet            Readiness probe failed: command "/app/grpc-health-probe -addr=:50051 -connect-timeout=10s -rpc-timeout=10s" timed out
  Warning  Unhealthy  4s (x2 over 14s)   kubelet            Liveness probe failed: command "/app/grpc-health-probe -addr=:50051 -connect-timeout=10s -rpc-timeout=10s" timed out
  Warning  Unhealthy  1s                 kubelet            Readiness probe failed: {"level":"info","ts":"2024-05-30T10:17:37.185Z","caller":"/usr/local/go/src/runtime/proc.go:267","msg":"timeout: failed to connect service \":50051\" within 10s"}

my ipamd.log cat /var/log/aws-routed-eni/ipamd.log says:

{"level":"info","ts":"2024-05-30T10:22:14.629Z","caller":"aws-k8s-agent/main.go:42","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2024-05-30T10:22:14.629Z","caller":"aws-k8s-agent/main.go:53","msg":"Testing communication with server"}
{"level":"error","ts":"2024-05-30T10:22:19.629Z","caller":"wait/loop.go:53","msg":"Unable to reach API Server, Get \"https://172.20.0.1:443/version?timeout=5s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
{"level":"error","ts":"2024-05-30T10:22:24.630Z","caller":"wait/loop.go:87","msg":"Unable to reach API Server, Get \"https://172.20.0.1:443/version?timeout=5s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}

My subnets have more that 8.000 ips available.

Checking node access to API endpoint

nc -vz XYZ.gr7.eu-central-1.eks.amazonaws.com
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection to XYZ failed: Connection timed out.
Ncat: Trying next address...
Ncat: Connection timed out.

The node has internet access. It can eg ping 8.8.8.8.

The API server endpoint access is set to public the allowlist is set 0.0.0.0/0. My node is in a private subnet. The subnet has got a route to a NAT Gateway that has a primary public ip assigned and is active.

What else can cause my node to be able to access the API server endpoint?

answered a year ago
  • Hi,

    1. Ensure Amazon EKS security group allows outbound traffic to port 443 (HTTPS) and it allows inbound traffic on port 443 from the security group associated with your nodes.
    2. Verify that the network ACLs (NACLs) associated with the subnets where your nodes and API server reside allow the necessary traffic.
    3. Check DNS Resolution for API Endpoint and check that this resolves to the correct IP address.
    4. nc -vz <resolved_ip> 443

    Hope this helps to troubleshoot. Let me know I would be happy to assist.

0

Hello,

I can confirm that Amazon VPC CNI v1.18.1-eksbuild.3 is compatible with EKS version 1.25 https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html

I would encourage you follow this troubleshooting guide to identity why the readiness checks are failing for this pod

  1. https://github.com/aws/amazon-vpc-cni-k8s/issues/1038
  2. https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md

Additionally you can also run SSM automation(AWSSupport-TroubleshootEKSWorkerNode) to help you identify and troubleshoot common causes that prevent worker nodes from joining a cluster.

Important: For the automation to work, your worker nodes must have permission to access Systems Manager and have Systems Manager running. To grant permission, attach the AmazonSSMManagedInstanceCore AWS managed policy to the IAM role that corresponds to your EC2 instance profile. This is the default configuration for EKS managed node groups that are created through eksctl.

References:

  1. https://repost.aws/knowledge-center/resolve-eks-node-failures
  2. https://repost.aws/knowledge-center/eks-cni-plugin-troubleshooting
EXPERT
answered a year ago
0

Thanks for your reply.

  1. Ensure Amazon EKS security group allows outbound traffic to port 443 (HTTPS) and it allows inbound traffic on port 443 from the security group associated with your nodes.

I'm using the same security group for EKS and for the nodes. Outbound is opened for IPv4 all traffic, all protocols, all ports for destination 0.0.0.0/0 Inbound there is one rule for the same security group with all protocols, all ports. This doesn't seem to be the cause for my issue.

  1. Verify that the network ACLs (NACLs) associated with the subnets where your nodes and API server reside allow the necessary traffic. For my VPC there is one network acl that has one rule for inbound that is widely open, same for outbound. This doesn't seem to be the cause for my issue.

  2. Check DNS Resolution for API Endpoint and check that this resolves to the correct IP address. nslookup for the API endpoint returns 2 ips This doesn't seem to be the cause for my issue.

  3. nc -vz <resolved_ip> 443

nc -vz XYZ.gr7.eu-central-1.eks.amazonaws.com
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connection to XYZ failed: Connection timed out.
Ncat: Trying next address...
Ncat: Connection timed out.

nc is still not successful.

Any other idea is highly appreciated.

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions