Ubuntu nodes failed to join fully private cluster

0

I have created a fully private cluster and it is working fine (means kubectl, eksctl, and aws commands are working) but there is a problem with the cluster. Whenever I create Amazon Linux 2 node instances, they successfully join the cluster but when I try to create Ubuntu instances then I get the following error message.

Instance failed to join the kubernetes cluster,(Service:null, Status Code: 0, Request ID:null)(RequestToken:c912435454-d3d1-2352-542321-4523543243, HandlerErrorCode:GeneralServiceException)
2 Answers
0
Accepted Answer

I have found the solution. As I am working behind the proxy and has no outbound internet access, I need to pass the userdata using OverrideBootstrapCommand.

For nodegroups that have no outbound internet access, you'll need to supply --apiserver-endpoint and --b64-cluster-ca to the bootstrap script as follows:

overrideBootstrapCommand: |
  #!/bin/bash

  source /var/lib/cloud/scripts/eksctl/bootstrap.helper.sh

  # Note "--node-labels=${NODE_LABELS}" needs the above helper sourced to work, otherwise will have to be defined manually.
  /etc/eks/bootstrap.sh ${CLUSTER_NAME} --container-runtime containerd --kubelet-extra-args "--node-labels=${NODE_LABELS}" \
    --apiserver-endpoint ${API_SERVER_URL} --b64-cluster-ca ${B64_CLUSTER_CA}

So when we run the bootstrap.helper.sh script then it will automatically find the specified variables in above script and we don't need to do anything.

Note the --node-labels setting. If this is not defined, the node will join the cluster, but eksctl will ultimately time out on the last step when it's waiting for the nodes to be Ready. It's doing a Kubernetes lookup for nodes that have the label alpha.eksctl.io/nodegroup-name=<cluster-name>. This is only true for unmanaged nodegroups.

If you have deployed NAT or any other kind of gateways then The minimum that needs to be used when overriding so eksctl doesn't fail, is labels! eksctl relies on a specific set of labels to be on the node, so it can find them. means there is no need to provide --apiserver-endpoint and --b64-cluster-ca

For more details, check this reference

answered 2 months ago
0

Hello,

It can happen due to multiple reasons. To understand what is causing the issue, try to SSH into one of the ubuntu instances that failed to join the cluster and check the kubelet status by running systemctl status kubelet. If kubelet is in active state, check kubelet logs for any errors by running journalctl -u kubelet

This troubleshooting article explains about various things to check when your nodes fail to join the cluster. Please check it out.

profile picture
SUPPORT ENGINEER
answered 2 months ago
  • Thanks for the answer. I did ssh into the instance and ran the below command and got systemctl status

    ● ip-192-168-69-96
        State: degraded
         Jobs: 0 queued
       Failed: 1 units
        Since: Mon 2022-09-26 13:55:18 UTC; 18min ago
       CGroup: /
               ├─user.slice 
               │ └─user-1000.slice 
               │   ├─user@1000.service 
               │   │ └─init.scope 
               │   │   ├─2968 /lib/systemd/systemd --user
               │   │   └─2969 (sd-pam)
               │   └─session-1.scope 
               │     ├─2965 sshd: ubuntu [priv]
               │     ├─3054 sshd: ubuntu@pts/0
               │     ├─3055 -bash
               │     ├─3141 systemctl status
               │     └─3142 pager
               ├─init.scope 
               │ └─1 /sbin/init
               └─system.slice 
                 ├─containerd.service 
                 │ └─546 /usr/bin/containerd
                 ├─systemd-networkd.service 
                 │ └─420 /lib/systemd/systemd-networkd
                 ├─systemd-udevd.service 
                 │ └─205 /lib/systemd/systemd-udevd
                 ├─system-serial\x2dgetty.slice 
                 │ └─serial-getty@ttyS0.service 
                 │   └─537 /sbin/agetty -o -p -- \u --keep-baud 115200,38400,9600 ttyS0 vt220
                 ├─networkd-dispatcher.service 
                 │ └─529 /usr/bin/python3 /usr/bin/networkd-dispatcher --run-startup-triggers
                 ├─snap.amazon-ssm-agent.amazon-ssm-agent.service 
                 │ ├─ 797 /snap/amazon-ssm-agent/5656/amazon-ssm-agent
                 │ └─1276 /snap/amazon-s
    
  • Looks like you ran "systemctl status" which gives status of all the systemd services. Run "systemctl status kubelet" to check for kubelet status alone. If kubelet isn't started, it could be a problem with your UserData.

  • You are right that there is problem with my UserData. As I am working behind the proxy and don't have NAT or any other kind of gateway, I need to export no_proxy to my .bashrc file. export no_proxy=cluster_API_Endpoints Could you please let me know how can I export this no_proxy in my .bashrc file for EKSCTL cluster? I have tried multiple guides but I am not successful. How I am trying to add no_proxy to .bashrc file is this

    overrideBootstrapCommand: |
          #!/bin/bash
          echo "export HTTPS_PROXY=http://192.168.69.96:12300" >> /home/ubuntu/.bashrc
          echo "export HTTP_PROXY=http://192.168.69.96:12300" >> /home/ubuntu/.bashrc
          echo "export NO_PROXY=localhost,127.0.0.1,07873DEF7D2181JFODSJ432JLKFDSJLK4324KLJDFS2F.gr7.eu-west-5.eks.amazonaws.com,192.168.86.39,192.168.31.21" >> /home/ubuntu/.bashrc
          set -o xtrace
          /etc/eks/bootstrap.sh test-cluster
    

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions