We are experiencing intermittent connectivity problems where approximately 20% of the calls made from pods within our EKS cluster to an internal AWS Network Load Balancer fail. Notably, calls originating from nodes or when routed through VPN connections are 100% successful, highlighting this issue as specific to pod-originated traffic.
After thorough investigation, we have determined that there are no network policies or security group settings contributing to these failures. This has been verified and can be ruled out as potential causes
Detailed Description:
Environment: Amazon EKS
Components: EKS pods, Internal AWS Network Load Balancer, Nginx Ingress Controller
Symptoms: Calls from the pods to the load balancer fail intermittently without generating any identifiable errors within the Nginx ingress controller logs.
Observations:
The Nginx ingress controller does not log the failed attempts, suggesting the drop occurs prior to reaching the ingress.
Load balancer logs show receipt of these requests but no subsequent packet transmission, indicating potential routing or IP resolution issues. Our cluster has 4 nodes, and ingress controller deployed as a daemon-set, and there is no single node that load balancer routes traffic to that fails, it happens for all nodes but at random
Assumption:
The issue may stem from pods being assigned IPs that fail to meet the load balancer’s routing rules.
Additional Insight from Load Balancer Nodes:
Our cluster consists of four nodes with the ingress controller deployed as a daemon-set. Importantly, there is no evidence to suggest that the intermittent failures are isolated to any single node; rather from the traffic captured on the load balancer logs suggest that the issue occurs randomly across all nodes. This random pattern complicates pinpointing a specific cause and suggests a more systemic issue within the network configuration or load balancing logic.
We are aware that our approach of directly addressing the load balancer is unconventional within EKS, where utilising service names (e.g., https://service:8080) for internal resolution is preferred.