Skip to content

Intermittent Unknownhost exception in backend deployed in EKS cluster

0

I have an EKS cluster which is using coreDNS add-on with the default resolver settings (which in my knowledge will use the VPC resolver IP of my region). I only have default VPC in the region.

My cluster uses classic load balancer. What happens is that, I get intermittent unknownhost exception for almost any domain (including my own domains hosted in route53 and external services that we use) in my backend app. First I thought that it would be an IPV6 issue because my cluster is not enabled with IPV6 but then it turned out to be fine because default VPC does not have IPV6 as well and it has dns host name and dns resolution both enabled.

Then I thought that may be my cluster is somehow sending too many DNS queries that I am hitting the limit of route 53. So I configured the packet mirroring to see how many UDP packets are being created every second. It was only 5-7 packets per second.

Besides when I run the nslookup from outside my EKS cluster domain name resolution works just fine.

I monitored the performance of the coreDNS pods and there was nothing unusual.

It is also important to note that I am running same backend in another EKS cluster hosted in another cloud provider. There it works just fine. So I believe that my backend is just fine.

It would be really helpful to know how can I debug this issue further and what could be the possible point of location for this error because I am running out of the ideas.

1 Answer
0

The issue you're facing with the intermittent "unknown host" exceptions in your EKS cluster could be due to a few reasons. Let's go through some troubleshooting steps to help you identify the root cause:

  1. Verify DNS Configuration:

    • Ensure that the VPC resolver IP is correctly configured in the CoreDNS add-on. You can check this by inspecting the ConfigMap for the CoreDNS add-on in your EKS cluster.
    • Verify that the DNS hostname and DNS resolution are enabled for your default VPC.
    • Check if there are any firewall rules or network ACLs in your VPC that might be blocking or rate-limiting DNS traffic.
  2. Investigate DNS Performance:

    • Monitor the performance of the CoreDNS pods, not just the overall usage, but also look for any spikes or inconsistencies in the query response times.
    • Check the CoreDNS logs for any errors or unusual behavior.
    • Use tools like dig or nslookup from within the pod to test the DNS resolution for your domains and external services.
  3. Analyze Network Connectivity:

    • Ensure that the Classic Load Balancer is correctly configured and able to resolve the DNS names of the backend services.
    • Check the network ACLs and security groups associated with the Classic Load Balancer and the worker nodes to ensure that they are not blocking any necessary traffic.
    • Verify that the worker nodes can access the Route53 service and the external services you are using.
  4. Explore Alternative DNS Providers:

    • Consider using a different DNS provider, such as Google DNS (8.8.8.8, 8.8.4.4) or Cloudflare DNS (1.1.1.1, 1.0.0.1), and see if the issue persists.
    • This can help you identify whether the problem is specific to the VPC resolver or a broader DNS-related issue.
  5. Compare Cluster Configurations:

    • Carefully compare the configuration and settings of the EKS cluster that is working correctly with the one experiencing the issues.
    • Look for any differences in the VPC setup, network ACLs, security groups, or other relevant configurations that could be causing the discrepancy.
  6. Consider Cluster Debugging:

    • Use the kubectl command-line tool to inspect the state of your EKS cluster, pods, and services.
    • You can also use tools like kubectl describe and kubectl logs to gather more information about the behavior of the CoreDNS pods and the DNS resolution process.
  7. Check for Resource Limits:

    • Ensure that there are no resource limits or quotas in place that could be causing the DNS resolution issues.
    • This could include limits on the number of DNS queries, network bandwidth, or other related resources.

By following these troubleshooting steps, you should be able to gather more information and identify the root cause of the intermittent "unknown host" exceptions in your EKS cluster. If you're still unable to resolve the issue, you may want to consider reaching out to the AWS support team for further assistance.

AWS
answered a year ago
  • Thank you for your suggestions. I enabled query logging for my VPC resolver. What I observed is that, out of the two nodes in my cluster, one node always points to the old ip before changing my DNS serivce provider to route 53. Now on that IP I do not have my services reachable. Any hints on why this might be happening?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.