How do I troubleshoot unhealthy targets for Network Load Balancers in Amazon EKS?

5 minute read
0

I want to resolve unhealthy targets for Network Load Balancers in my Amazon Elastic Kubernetes Service (Amazon EKS).

Short description

The following are common reasons why the targets for your Network Load Balancer are unhealthy:

  • The health check is incorrectly configured. To resolve this issue, manually initiate the health check from a host machine that's running within the Amazon Virtual Private Cloud (Amazon VPC).
  • There's an unexpected exception from the pod. To resolve this issue, follow the troubleshooting steps in the Check if there's an unexpected exception from the pod Resolution section.
  • A Network Load Balancer with the externalTrafficPolicy is set to Local (from the Kubernetes website), with a custom Amazon VPC DNS on the DHCP options set. To resolve this issue, patch kube-proxy with the hostname override flag.

Note: You can determine if the target group type is an IP address or instance by seeing if the service annotation service.beta.kubernetes.io/aws-load-balancer-nlb-target-type exists.

Resolution

Verify if the target group is an IP address or instance

Run the following command:

kubectl get service service_name -o yaml

Note: Replace service_name with your service's name. If the service.beta.kubernetes.io/aws-load-balancer-nlb-target-type annotation isn't present, then the default target type is an instance.

Verify that the health check is correctly configured

Check which Elastic Load Balancing (ELB) annotations (from the Kubernetes website) are configured for your service:

`kubectl get service service_name -o yaml`

Example output:

service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
# The number of successive successful health checks required for a backend to be considered healthy for traffic. Defaults to 2, must be between 2 and 10

service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"
# The number of unsuccessful health checks required for a backend to be considered unhealthy for traffic. Defaults to 6, must be between 2 and 10

service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "20"
# The approximate interval, in seconds, between health checks of an individual instance. Defaults to 10, must be between 5 and 300

service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "5"
# The amount of time, in seconds, during which no response means a failed health check. This value must be less than the service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval value. Defaults to 5, must be between 2 and 60

service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: TCP
service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: traffic-port
# can be integer or traffic-port

service.beta.kubernetes.io/aws-load-balancer-healthcheck-path
# health check path.

If the preceding annotations are incorrectly configured, then the targets can be unhealthy.

Manually initiate the health check from a host machine that's running within the Amazon VPC

For instance target types, run the following curl command with NodePort:

curl-ivk node_IP:NodePort

Note: Replace node_IP with your node's IP address.

For IP address target types, run the following curl command:

curl -ivk pod_IP:pod_port

Note: Replace pod_IP with your pod's IP address and pod_port with your pod's port.

Check if there's an unexpected exception from the pod

Instance target type

Check the service specification for the current health check configuration annotations (from the GitHub website):

kubectl get service service_name -o yaml

Check if there are endpoints to verify that there are pods behind the service:

kubectl get endpoints service_name -o yaml

If no endpoints exist for the service, then check that the pod labels and service labels match:

kubectl describe service
kubectl describe pod pod_name or kubectl get pod --show-labels

Note: Replace pod_name with your pod's name.

Check if the pods are in Running status:

kubectl get pod -o wide

Check the pods' statuses to verify that the pods are running without any restarts:

kubectl get pods -o wide

If there are restarts, then collect the pod logs to determine the cause:

kubectl logs pod_name
kubectl logs pod_name --previous

Log in to a host machine in the Amazon VPC where you can communicate with the node.

Use the curl command with NodePort to check if the pods are returning the expected HTTP status code:

curl node_IP:NodePort

If the curl command didn't return the expected HTTP status code, then the backend pods also aren't returning the expected HTTP status code.

Use the same host machine to connect to the pod's IP address and check if the pod is correctly configured:

curl pod_IP:pod_port

If the curl command didn't return the expected HTTP status code, then the pod isn't correctly configured.

Note: If the service's externalTrafficPolicy (from the Kubernetes website) is set to Local, then only the nodes where the service's backend pods are running are seen as healthy targets.

IP address target type

Check the service specification for the current health check configuration annotations (from the GitHub website):

kubectl get service service_name -o yaml

Log in to a host machine in the Amazon VPC and use the curl command to communicate with pod's IP address:

curl pod_IP:pod_port

If the curl command didn't return the expected HTTP status code, then the pod isn't correctly configured.

Patch kube-proxy with the hostname override flag

Modify the kube-proxy daemonset specification command, args, and env, with:

---
spec:
  template:
    spec:
      containers:
      - name: kube-proxy
        command: [ "/bin/sh" ]
        args:
        - -c
        - |
          kube-proxy --v=2 --hostname-override=$(NODE_NAME) --config=/var/lib/kube-proxy-config/config
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName

For instance target types, if the externalTrafficPolicy is set to Cluster or Local, then the node security group's default ingress setting for NodePort is 0.0.0.0/0. Also, when the externalTrafficPolicy is set to Local, an additional health check NodePort is configured to allow subnet CIDR IP address ranges.

To control the source IP address on the node security group for NodePort, add loadBalancerSourceRanges in the specification and include the ranges:

spec:
loadBalancerSourceRanges:
- "143.231.0.0/16"
- "xx.yy.zz.zz/24"

Note: If the .spec.loadBalancerSourceRanges isn't set, then Kubernetes allows traffic from 0.0.0.0/0 to the node security groups. If the nodes have public IP addresses, then non Network Load Balancer traffic can also reach every instance in the modified security groups.


AWS OFFICIAL
AWS OFFICIALUpdated a year ago