Skip to content

How do I troubleshoot DNS failures with Amazon EKS?

10 minute read
2

The applications or Pods that use CoreDNS in my Amazon Elastic Kubernetes Service (Amazon EKS) cluster fail internal or external DNS name resolution.

Short description

Pods that run inside the Amazon EKS cluster use the CoreDNS cluster IP address as the name server to query internal and external DNS records. If there are issues with the CoreDNS Pods, service configuration, or connectivity, then applications might fail DNS resolution.

The kube-dns service object abstracts the CoreDNS Pods. To troubleshoot issues with your CoreDNS Pods, verify the working status of all the kube-dns service components, such as service endpoint options and iptables rules.

Resolution

Note: In the following resolution, the CoreDNS ClusterIP value is 10.100.0.10.

To check your DNS configuration, complete the following steps:

  1. To get the ClusterIP of your CoreDNS service, run the following command:

    kubectl get service kube-dns -n kube-system
  2. To verify that the DNS endpoints are exposed and point to the CoreDNS Pods, run the following command:

    kubectl -n kube-system get endpoints kube-dns

    Example output:

    NAME       ENDPOINTS                                                        AGE
    kube-dns   192.168.2.218:53,192.168.3.117:53,192.168.2.218:53 + 1 more...   90d

    Note: If the endpoint list is empty, then check the Pod status of the CoreDNS Pods.

  3. Confirm that your security groups and network access control list (network ACL) don't block the Pods when they communicate with CoreDNS.

For more information, see Why won't my pods connect to other pods in Amazon EKS?

Verify that the kube-proxy Pod works

To check whether the kube-proxy Pod has access to API servers for your cluster, check your logs for timeout errors to the control plane. Also, check for 403 unauthorized errors.

To get the kube-proxy logs, run the following command:

kubectl logs -n kube-system --selector 'k8s-app=kube-proxy'

Note: The kube-proxy gets the endpoints from the control plane and creates the iptables rules on each node.

Check the CoreDNS Pod CPU usage at the time of the issue

The Amazon EKS CoreDNS add-on adds only the 170 MiB quota to the CoreDNS Pod's memory. The CoreDNS Pod doesn't define a CPU quota, so the container can use all the available CPU resources on the node where it runs. If the node's CPU utilization is at 100%, then you might get DNS timeout errors in your Amazon EKS application logs. This is because the CoreDNS pod doesn't have enough CPU resources to manage all DNS queries.

To check the current CPU and memory usage of the CoreDNS Pods, run the following command:

kubectl top pods -n kube-system -l k8s-app=kube-dns

To check the current CPU and memory usage of the Amazon EKS cluster nodes, run the following command:

kubectl top nodes

Connect to the application Pod to troubleshoot the DNS issue

Complete the following steps:

  1. To run commands inside your application Pods, run the following command:

    kubectl exec -it your-pod-name -- sh

    Note: Replace your-pod-name with your Pod name.
    The preceding command allows you to access a shell inside the running Pod. If the application pod doesn't have an available shell binary, then you receive an error similar to the following example:
    "OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "exec: \"sh\": executable file not found in $PATH": unknown command terminated with exit code 126"
    To resolve this issue, update the image that you use in your pod-manifest.yaml manifest file with another image. An example image is busybox on the Docker website.

  2. To verify that the kube-dns service's cluster IP address is in your Pod's /etc/resolv.conf file, run the following command in the Pod shell:

    cat /etc/resolv.conf

    The following example resolv.conf file shows a pod that's configured to point to 10.100.0.10 for DNS requests. The IP address must match the ClusterIP value of your kube-dns service:

    nameserver 10.100.0.10
    search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
    options ndots:5

    Note: You can manage your Pod's DNS configuration with the dnsPolicy field in the Pod specification. If you don't populate this field, then Amazon EKS uses the ClusterFirst DNS policy by default. For more information about the ClusterFirst DNS policy, see Pod's DNS policy on the Kubernetes website.

  3. To verify that your Pod can use the default ClusterIP value to resolve an internal domain, run the following command in the Pod shell:

    nslookup kubernetes.default 10.100.0.10

    Example output:

    Server:     10.100.0.10
    Address:    10.100.0.10#53
    Name:       kubernetes.default.svc.cluster.local
    Address:    10.100.0.1
  4. To verify that your Pod can use the default ClusterIP value to resolve an external domain, run the following command in the Pod shell:

    nslookup amazon.com 10.100.0.10

    Example output:

    Server:     10.100.0.10
    Address:    10.100.0.10#53
    Non-authoritative answer:
    Name:   amazon.com
    Address: 176.32.98.166
    Name:    amazon.com
    Address: 205.251.242.103
    Name:    amazon.com
    Address: 176.32.103.205
  5. To get the kube-dns endpoints, run the following command:

    kubectl get endpoints kube-dns -n kube-system
  6. To verify that your Pod can use the CoreDNS Pod IP address to resolve directly, run the following command in the Pod shell:

    nslookup kubernetes COREDNS_POD_IP
    nslookup amazon.com COREDNS_POD_IP

    Note: Replace COREDNS_POD_IP with the kube-dns endpoint IP addresses.

Get more detailed logs from CoreDNS Pods to debug further issues

Complete the following steps:

  1. To activate the CoreDNS Pod debug log and add the log plugin to the CoreDNS ConfigMap, run the following command:
    kubectl -n kube-system edit configmap coredns
    Note: For more information, see log on the CoreDNS website.
  2. In the command output's editor screen, add the following log string:
    kind: ConfigMap
    apiVersion: v1
    data:
      Corefile: |
        .:53 {
            log    # Activating CoreDNS Logging
            errors
            health
            kubernetes cluster.local in-addr.arpa ip6.arpa {
              pods insecure
              upstream
              fallthrough in-addr.arpa ip6.arpa
            }
            ...
    ...
    Note: It takes several minutes to reload the CoreDNS configuration. To immediately apply the changes, restart the Pods one by one.
  3. To check whether the CoreDNS logs fail or get traffic from the application Pod, run the following command:
    kubectl logs --follow -n kube-system --selector 'k8s-app=kube-dns'

Update the ndots value

The ndots value is the number of dots that must appear in a domain name to resolve a query before the initial absolute query. For example, you can set ndots to the default 5 in a domain name that's not fully qualified. In this scenario, all external domains that aren't under the cluster.local internal domain append to the search domains before they query.

The following example has the /etc/resolv.conf file setting of the application Pod:

nameserver 10.100.0.10search default.svc.cluster.local svc.cluster.local cluster.local ec2.internal
options ndots:5

In the preceding example configuration, CoreDNS looks for five dots in the queried domain. If the Pod makes a DNS resolution call for amazon.com, then your logs look similar to the following example:

[INFO] 192.168.3.71:33238 - 36534 "A IN amazon.com.default.svc.cluster.local. udp 54 false 512" NXDOMAIN qr,aa,rd 147 0.000473434s[INFO] 192.168.3.71:57098 - 43241 "A IN amazon.com.svc.cluster.local. udp 46 false 512" NXDOMAIN qr,aa,rd 139 0.000066171s
[INFO] 192.168.3.71:51937 - 15588 "A IN amazon.com.cluster.local. udp 42 false 512" NXDOMAIN qr,aa,rd 135 0.000137489s
[INFO] 192.168.3.71:52618 - 14916 "A IN amazon.com.ec2.internal. udp 41 false 512" NXDOMAIN qr,rd,ra 41 0.001248388s
[INFO] 192.168.3.71:51298 - 65181 "A IN amazon.com. udp 28 false 512" NOERROR qr,rd,ra 106 0.001711104s

Note: NXDOMAIN means that the Pod didn't find the domain record. NOERROR means that the Pod successfully found the domain record.

Each search domain has the amazon.com prefix before it makes the final call on the absolute domain that's at the end. A final domain name that you append with a dot (.) at the end is a fully qualified domain name. For each external domain name query, there might be four or five additional calls that can overwhelm the CoreDNS Pod.

To resolve this issue, change ndots to 1 to look for only one dot. Or, append a dot at the end of the domain that you query or use. Example:

nslookup example.com.

Check the AmazonProvidedDNS VPC resolver quotas

The Amazon Virtual Private Cloud (Amazon VPC) resolver can accept a maximum quota of 1024 packets in one second for each elastic network interface. If more than one CoreDNS Pod is on the same node, then you might reach this quota for external domain queries.

To use PodAntiAffinity rules to schedule CoreDNS Pods on separate instances, add the following options to the CoreDNS deployment:

podAntiAffinity:
  preferredDuringSchedulingIgnoredDuringExecution:
  - podAffinityTerm:
      labelSelector:
        matchExpressions:
        - key: k8s-app
          operator: In
          values:
          - kube-dns
      topologyKey: kubernetes.io/hostname
    weight: 100

Note: For more information about PodAntiAffinity, see Inter-pod affinity and anti-affinity on the Kubernetes website.

Use tcpdump to capture CoreDNS packets from Amazon EKS worker nodes

To diagnose DNS resolution issues, complete the following steps to use the tcpdump tool to perform a packet capture:

  1. To locate a worker node where a CoreDNS pod is running, run the following command:

    kubectl get pod -n kube-system -l k8s-app=kube-dns -o wide
  2. To use SSH to connect to the worker node and install the tcpdump tool, run the following command:

    sudo yum install tcpdump - y
  3. To locate the CoreDNS Pod process ID on the worker node, run the following command:

    ps ax | grep coredns
  4. From the worker node, run the following command to perform a packet capture on CoreDNS Pod network traffic on UDP port 53:

    sudo nsenter -n -t PID tcpdump udp port 53
  5. From a separate terminal, run the following command to get the CoreDNS service and Pod IP address:

    kubectl describe svc kube-dns -n kube-system

    Note: Note the service IP address in the IP field and the pod IP address in the Endpoints field.

  6. Launch a pod to test the DNS service. The following example uses an Ubuntu container image:

    kubectl run ubuntu --image=ubuntu sleep 1d
    kubectl exec -it ubuntu sh
  7. Run the following command to use the nslookup tool to perform a DNS query to the amazon.com domain:

    nslookup amazon.com

    To explicitly perform the same query against the CoreDNS service IP address, run the following command:

    nslookup amazon.com COREDNS_SERVICE_IP

    Note: Replace COREDNS_SERVICE_IP with your CoreDNS service IP address.
    To perform the query against each CoreDNS Pod IP address, run the following command:

    nslookup amazon.com COREDNS_POD_IP

    Note: Replace COREDNS_POD_IP with your CoreDNS Pod IP address. If you run multiple CoreDNS Pods, then perform multiple queries. This way, Amazon EKS sends at least one query to the Pod that you capture traffic from.

  8. Review the packet capture results.
    If the CoreDNS Pod experiences DNS query timeouts, and you don't see the query in the packet capture, then check your network connectivity. Check the network reachability between worker nodes.
    If you see DNS query timeouts on a Pod IP address that you didn't capture, then perform another packet capture on the related worker node.
    To save the results of a packet capture, add the -w FILE_NAME flag to the tcpdump command. The following example writes the results to the capture.pcap filef:

    tcpdump -w capture.pcap udp port 53

Related information

CoreDNS GA for Kubernetes cluster DNS on the Kubernetes website

2 Comments

If everything seems fine but if you are still not able to find a solution for DNS failure try deleting your code-dns pods it will restart again and it might solve the issue.

worked for me

AWS
replied 2 years ago

The "Update the ndots value" section contains this incorrect statement:

Note: NXDOMAIN means that the Pod found the domain record. NOERROR means that the Pod didn't find the domain record.

NXDOMAIN means that the pod did NOT find the domain record. NOERROR means that the pod did SUCCESSFULLY resolve the domain record.

replied 7 months ago