Amazon EKS requests failing staging environment.

0

Hello,

I am currently running a EKS cluster on version 1.28 with a node group of 3 servers behind a load balancer. There is prod and staging namespace, production is working as intended, but when looking at ingress-nginx logs for staging all the requests return 504 time outs. Another detail is that when looking at the target group the instance on which production pods are running is Healthy, while the other ones are Unhealthy when doing health checks on port 32272. When looking at network interfaces for the nodes, the one that is working has 4 ENIs and the other ones have only 2 ENIs. What could be the underlying cause of this?

  • please accept the answer if it was useful

1 Answer
0

Based on your description of the issues you're experiencing with your Amazon EKS staging environment, here are several potential causes and steps you can take to diagnose and resolve the problem:

Health Check Configuration: The 504 timeouts and the Unhealthy status in the target group suggest there might be an issue with how your health checks are configured. Ensure that the health check settings in your load balancer match the requirements of the staging environment. Check if the health check port (32272 in your case) is correctly configured and open on all instances, and make sure the health check path returns a successful response.

Network Interface and Security Group Configuration: The discrepancy in the number of Elastic Network Interfaces (ENIs) between the node running production pods (4 ENIs) and the nodes for staging (2 ENIs) might indicate a limitation in network or a misconfiguration. Each ENI supports a certain number of IP addresses and therefore a certain number of pods. Inspect:

  • Security Group Rules: Ensure that the security groups attached to the ENIs allow traffic on the necessary ports, including the health check port and any other service ports used by your staging pods.
  • Subnet and VPC Settings: Verify that the subnets associated with the node groups have sufficient IP addresses available and are correctly configured in the VPC.

Check Pod and Node Logs: Since the production environment is running fine, compare the configurations of the staging and production environments. Look at the logs for the staging pods and nodes, particularly focusing on networking and connectivity issues. Use commands like kubectl describe pods and kubectl logs for detailed information.

Resource Limits and Requests: Check if there are any resource limits or requests that are being hit in the staging environment. Sometimes, pods might not have enough CPU or memory allocated to them, which can lead to pods not starting properly and failing health checks.

Load Balancer Configuration: Double-check your load balancer's configuration, particularly around target group settings for port and protocol. Also, make sure that the load balancer's settings are consistent between production and staging if they are supposed to be mirrored.

Update and Rollback Scenarios: If recent changes were made to the staging environment (like an update or configuration change), consider rolling back those changes to see if the issue persists. Sometimes, updates can introduce unexpected behaviors that are not immediately evident.

AWS Support: If you continue to have issues and cannot isolate the cause, consider reaching out to AWS Support for more in-depth analysis, especially if there might be underlying issues with the AWS services themselves (like EKS or EC2).

By systematically checking these areas, you should be able to pinpoint the root cause of the 504 timeouts and the unhealthy status in your staging environment.

profile picture
EXPERT
answered 20 days ago
profile picture
EXPERT
reviewed 19 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions