How can I change the status of my nodes from NotReady or Unknown status to Ready status?

7 minute read
0

My Amazon Elastic Kubernetes Service (Amazon EKS) worker nodes are in NotReady or Unknown status. I want to get my worker nodes back in Ready status again.

Short description

You can't schedule pods on a node that's in NotReady or Unknown status. You can schedule pods only on a node that's in Ready status.

The following resolution addresses nodes in NotReady or Unknown status.

If your node is in the MemoryPressure, DiskPressure, or PIDPressure status, then you must manage your resources to allow additional pods to be scheduled on the node. If your node is in NetworkUnavailable status, then you must properly configure the network on the node. For more information, see Node status on the Kubernetes website.

Note: For information on managing pod evictions and resource limits, see Node-pressure eviction on the Kubernetes website.

Resolution

Check the aws-node and kube-proxy pods to see why the nodes are in NotReady status

A node in NotReady status isn't available for pods to be scheduled on.

The managed node group stopped attaching the Container Network Interface (CNI) policy to the node role's Amazon Resource Name (ARN) to improve the security posture. This causes nodes to change to NotReady status because of a missing CNI policy.

1.    To see if the aws-node pod is in the error state, run the following command:

$ kubectl get pods -n kube-system -o wide

To resolve this issue, follow the guidelines to set up IAM Roles for Service Accounts (IRSA) for aws-node DaemonSet.

2.    To check the status of your aws-node and kube-proxy pods, run the following command:

$ kubectl get pods -n kube-system -o wide

3.    Check the status of the aws-node and kube-proxy pods by reviewing the output from step 1.

Note: The aws-node and kube-proxy pods are managed by a DaemonSet. This means that each node in the cluster must have one aws-node and kube-proxy pod running on it. If no aws-node or kube-proxy pods are listed, skip to step 4. For more information, see DaemonSet on the Kubernetes website.

If your node status is normal, then your aws-node and kube-proxy pods should be in Running status. For example:

$ kubectl get pods -n kube-system -o wide
NAME                             READY   STATUS    RESTARTS   AGE        IP              NODE
aws-node-qvqr2                   1/1     Running   0          4h31m      192.168.54.115  ip-192-168-54-115.ec2.internal
kube-proxy-292b4                 1/1     Running   0          4h31m      192.168.54.115  ip-192-168-54-115.ec2.internal

If either pod is in a status other than Running, run the following command:

$ kubectl describe pod yourPodName -n kube-system

4.    To get additional information from the aws-node and kube-proxy pod logs, run the following command:

$ kubectl logs yourPodName -n kube-system

The logs and the events from the describe output can show why the pods aren't in Running status. For a node to change to Ready status, both the aws-node and kube-proxy pods must be Running on that node.

Note: The name of the pods can differ from aws-node-qvqr2 and kube-proxy-292b4, as shown in the preceding examples.

5.    If the aws-node and kube-proxy pods aren't listed after running the command from step 1, then run the following commands:

$ kubectl describe daemonset aws-node -n kube-system
$ kubectl describe daemonset kube-proxy -n kube-system

6.    Search the output of the commands in step 4 for a reason why the pods can't be started.

Tip: You can search the Amazon EKS control plane logs for information on why the pods can't be scheduled.

7.    Confirm that the versions of aws-node and kube-proxy are compatible with the cluster version based on AWS guidelines. For example, you can run the following commands to check the pod versions:

$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2

$ kubectl get daemonset kube-proxy --namespace kube-system -o=jsonpath='{$.spec.template.spec.containers[:1].image}'

Note: To update the aws-node version, see Managing the Amazon VPC CNI plugin for Kubernetes add-on. To update the kube-proxy version, follow step 4 in Update the Kubernetes version for your Amazon EKS cluster.

In some scenarios, the node can be in Unknown status. This means that the kubelet on the node can't communicate with the control plane with the correct status of the node.

To troubleshoot nodes in Unknown status, complete the steps in the following sections:

  • Check the network configuration between nodes and the control plane
  • Check the status of the kubelet
  • Check that the Amazon EC2 API endpoint is reachable

Check the network configuration between nodes and the control plane

1.    Confirm that there are no network access control list (ACL) rules on your subnets blocking traffic between the Amazon EKS control plane and your worker nodes.

2.    Confirm that the security groups for your control plane and nodes comply with minimum inbound and outbound requirements.

3.    (Optional) If your nodes are configured to use a proxy, confirm that the proxy is allowing traffic to the API server endpoints.

4.    To verify that the node has access to the API server, run the following netcat command from inside the worker node:

$ nc -vz 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com 443
Connection to 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com 443 port [tcp/https] succeeded!

Important: Replace 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com with your API server endpoint.

5.    Check that the route tables are configured correctly to allow communication with the API server endpoint through either an internet gateway or NAT gateway. If the cluster makes use of PrivateOnly networking, verify that the VPC endpoints are configured correctly.

Check the status of the kubelet

1.    Use SSH to connect to the affected worker node.

2.    To check the kubelet logs, run the following command:

$ journalctl -u kubelet > kubelet.log

Note: The kubelet.log file contains information on kubelet operations that can help you find the root cause of the node status issue.

If the logs don't provide information on the source of the issue, then run the following command to check the status of the kubelet on the worker node:

$ sudo systemctl status kubelet
  kubelet.service - Kubernetes Kubelet
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-eksclt.al2.conf
   Active: inactive (dead) since Wed 2019-12-04 08:57:33 UTC; 40s ago

If the kubelet isn't in the Running status, then run the following command to restart the kubelet:

$ sudo systemctl restart kubelet

Check that the Amazon EC2 API endpoint is reachable

1.    Use SSH to connect to one of the worker nodes.

2.    To check if the Amazon Elastic Compute Cloud (Amazon EC2) API endpoint for your AWS Region is reachable, run the following command:

$ nc -vz ec2.<region>.amazonaws.com 443
Connection to ec2.us-east-1.amazonaws.com 443 port [tcp/https] succeeded!

Important: Replace us-east-1 with the AWS Region where your worker node is located.

Check the worker node instance profile and the ConfigMap

1.    Confirm that the worker node instance profile has the recommended policies.

2.    Confirm that the worker node instance role is in the aws-auth ConfigMap. To check the ConfigMap, run the following command:

$ kubectl get cm aws-auth -n kube-system -o yaml

The ConfigMap should have an entry for the worker node instance AWS Identity and Access Management (IAM) role. For example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: aws-auth
  namespace: kube-system
data:
  mapRoles: |
    - rolearn: <ARN of instance role (not instance profile)>
      username: system:node:{{EC2PrivateDNSName}}
      groups:
        - system:bootstrappers
        - system:nodes

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago