Websocket session disconnecting sporadically

0

I have a single-pod WebSocket application. My clients have noticed sporadic disconnections of their sessions, usually occurring 5-10 times per day. Upon investigating the Network Load Balancer (NLB) logs, we discovered that there were some heartbeat requests logged in the NLB, but no corresponding requests on our server for the given timestamp. This suggests that the NLB sent a reset (RST) to the client. Further investigation revealed that the NLB serving the target failed health checks on all 272 nodes, resulting in it operating in fail-open mode. After correcting the health check, we were able to fix the issue.

My question is, why did operating in fail-open mode cause the session disconnection issue to occur sporadically? Why would fixing the health check issue resolve the disconnect problem? Has Target ceased being unhealthy? The NLB marked the target as unhealthy to function in fail-open mode. Therefore, the target's health status should no longer be a topic of discussion. Shouldn't the NLB continue to route client requests to Target regardless of its health status in fail-open mode?

asked 10 months ago527 views
2 Answers
0

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html

The Network Load Balancer (NLB) operates in fail-open mode when all registered targets are unhealthy. In this mode, the NLB continues to route requests to all registered targets, rather than removing all the IP addresses from DNS when all targets are unhealthy and their respective Availability Zones do not have a healthy target to send requests to​.

For each TCP request that a client makes through an NLB, the state of the connection is tracked. If no data is sent through the connection by either the client or the target for longer than the idle timeout, the connection is closed. If a client or a target sends data after the idle timeout period elapses, it receives a TCP RST packet to indicate that the connection is no longer valid. Moreover, if a target becomes unhealthy, the load balancer sends a TCP RST for packets received on the client connections associated with the target, unless the unhealthy target triggers the load balancer to fail open​.

In the context of your situation, it's possible that the NLB operating in fail-open mode could be sporadically sending TCP RST packets to clients. This could occur when a target is intermittently failing and becoming unhealthy or when the idle timeout period is exceeded. These TCP RST packets would cause the client connections to close, resulting in the sporadic disconnections observed by your clients.

profile picture
EXPERT
answered 10 months ago
  • Why would fixing the health check issue resolve the disconnect problem? Has Target ceased being unhealthy? The NLB marked the target as unhealthy to function in fail-open mode. Therefore, the target's health status should no longer be a topic of discussion. Shouldn't the NLB continue to route client requests to Target regardless of its health status in fail-open mode?

  • When the Network Load Balancer (NLB) operates in fail-open mode, it indeed continues to route traffic to all registered targets, even if they are marked as unhealthy. However, the behavior of the NLB when a target is unhealthy is a bit different compared to when targets are healthy.

    As stated in the previous answer, for each TCP request that a client makes through an NLB, the state of the connection is tracked. If a target becomes unhealthy, the NLB sends a TCP RST (reset) packet for packets received on the client connections associated with that target, unless the unhealthy target triggers the load balancer to fail open. This reset would cause the client connection to close, which could result in sporadic disconnections that your clients are experiencing.

    By fixing the health check issue and therefore making the target healthy again, the NLB would no longer send these TCP RST packets to clients, as it only does this when a target is unhealthy. So, even though the NLB would continue to route traffic to the target in fail-open mode, the fact that the target is now healthy means that the NLB won't be sending TCP RST packets that could be causing the sporadic disconnections.

  • In conclusion, fixing the health check issue would potentially solve the disconnect problem by preventing the NLB from sending TCP RST packets that cause client disconnections. The target's health status is indeed still a topic of discussion in fail-open mode because, while the NLB continues to route traffic to the target, it behaves differently depending on whether the target is healthy or not.

  • Despite my single pod service never being unhealthy, the NLB erroneously detected it as unhealthy due to a health check failure. Nevertheless, the service continued to handle all requests, even in fail-open mode. However, in this mode, the NLB occasionally sent RST packets to clients, leading to intermittent disconnects. After resolving the health check issue, the server's health remained unchanged, but the NLB started accurately detecting healthy nodes. Considering this, if the service had been deemed unhealthy before addressing the health check problem, one would expect it to remain unhealthy even after the issue was resolved. If we apply your logic, it would suggest that RST packets should have been sent when the health check was functioning normally, as this would have allowed the NLB to correctly identify unhealthy nodes—an outcome that was not possible previously due to all nodes failing health checks in fail-open mode. I think In the case of a single-pod application, the health check configuration of the NLB might not have a significant impact on the health status of the application. My single pod hasn't restarted in the last 40 days. CPU utilization is less than 5% and memory usage is less than 2%.

-1

Hi, it seems that your question is answered here: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html

If a target becomes unhealthy, the load balancer sends a TCP RST for packets 
received on the client connections associated with the target, unless the unhealthy 
target triggers the load balancer to fail open.

If target groups don't have a healthy target in an enabled Availability Zone, we remove
 the IP address for the corresponding subnet from DNS so that requests cannot be 
routed to targets in that Availability Zone. If all targets fail health checks at the same time 
in all enabled Availability Zones, the load balancer fails open. The effect of the fail open is 
to allow traffic to all targets in all enabled Availability Zones, regardless of their health status.

Hope it helps,

Didier

profile pictureAWS
EXPERT
answered 10 months ago
profile picture
EXPERT
reviewed 10 months ago
  • In our case, the Target was already registered as unhealthy in fail-open mode. So, why would the NLB send RST to clients sporadically?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions