Intermittent Failures in Pod Communications with Internal AWS Network Load Balancer

0

We are experiencing intermittent connectivity problems where approximately 20% of the calls made from pods within our EKS cluster to an internal AWS Network Load Balancer fail. Notably, calls originating from nodes or when routed through VPN connections are 100% successful, highlighting this issue as specific to pod-originated traffic.

After thorough investigation, we have determined that there are no network policies or security group settings contributing to these failures. This has been verified and can be ruled out as potential causes

Detailed Description:

Environment: Amazon EKS Components: EKS pods, Internal AWS Network Load Balancer, Nginx Ingress Controller Symptoms: Calls from the pods to the load balancer fail intermittently without generating any identifiable errors within the Nginx ingress controller logs. Observations: The Nginx ingress controller does not log the failed attempts, suggesting the drop occurs prior to reaching the ingress. Load balancer logs show receipt of these requests but no subsequent packet transmission, indicating potential routing or IP resolution issues. Our cluster has 4 nodes, and ingress controller deployed as a daemon-set, and there is no single node that load balancer routes traffic to that fails, it happens for all nodes but at random Assumption: The issue may stem from pods being assigned IPs that fail to meet the load balancer’s routing rules. Additional Insight from Load Balancer Nodes: Our cluster consists of four nodes with the ingress controller deployed as a daemon-set. Importantly, there is no evidence to suggest that the intermittent failures are isolated to any single node; rather from the traffic captured on the load balancer logs suggest that the issue occurs randomly across all nodes. This random pattern complicates pinpointing a specific cause and suggests a more systemic issue within the network configuration or load balancing logic.

We are aware that our approach of directly addressing the load balancer is unconventional within EKS, where utilising service names (e.g., https://service:8080) for internal resolution is preferred.

asked 10 months ago458 views
2 Answers
1

I would highly recommend that you open a support case to help trace and resolve this issue.

profile pictureAWS
EXPERT
answered 10 months ago
profile picture
EXPERT
reviewed 10 months ago
0
Accepted Answer

Just to leave a comment. It might be useful to someone. After we set "Preserve client IP addresses" to false it resolved all of our issues. It seems that there is some Load Balancer magic when calling internal endpoint from EKS pod. Even though the caller IP addresses were in the right range, IPs from pod calls were still either lost or the internal access was declined by the Load Balancer.

Thanks

answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions