Network Load Balancer stickiness seems to fail sometimes

1

We have a SignalR javascript client connecting to .net core hub, hosted in AWS. Both client and server use version 6.

More than one backend server may exist, so there is an internet facing Network Load Balancer forwarding the traffic to the backend servers. The NLB is configured with these options:

  • Stickiness
  • Preserve client IP addresses

Most of the time, everything works great: the negotiation and the connection upgrade. Sometimes, however, something strange happens: the negotiation fails (error in WebSocket transport), then the client tries again with another transport (SSE). This transport also fails and, while retrying, the client hits the other host, starting the negotiation again. Finally, the connection succeeds. All this process takes no more than 5 seconds.

This was happening in our clients, from outside, so we set up an isolated environment to debug this situation, with the NLB and 2 backend hosts. This is internal, so no one else is connecting, for sure, so there is no chance hosts are overloaded. We are completely sure our IP does not change while the test is being done. We enabled client and server debug logs, shown below.

This still happens sometimes, no matter the host that we hit on the first attempt. We know that we can configure the client to skip the negotiation, but that will make us lose about 10% of our clients, because we will be limited to use the WebSockets transport.

From the logs, IP stickiness seems to be failing somehow...

What is misconfigured in our setup? How can the negotiation fail if just one client is connecting and the IP does not change? What else can we configure in the AWS NLB to ensure the IP stickiness?

Thanks in advance!

When the connection succeeds at the first attempt

Client logs
Debug: Starting connection with transfer format 'Text'.
Debug: Sending negotiation request: https://<server>...
Debug: Selecting transport 'WebSockets'.
Trace: (WebSockets transport) Connecting.
Information: WebSocket connected to wss://<server>...
Debug: The HttpConnection connected successfully.

Server logs
[DBG] New connection r3Gv5PrBgTA2T6lijqwTrA created.
[DBG] Sending negotiation response.
[DBG] Establishing new connection.
[DBG] Socket opened using Sub-Protocol: 'null'.
[DBG] OnConnectedAsync started.
[DBG] Found protocol implementation for requested protocol: json.
[DBG] Completed connection handshake. Using HubProtocol 'json'.

When the connection fails at the first attempt

Client logs
Debug: Starting connection with transfer format 'Text'.
Debug: Sending negotiation request: https://<server>...
Debug: Selecting transport 'WebSockets'.
Trace: (WebSockets transport) Connecting.
WebSocket connection to 'wss://<server>...' failed:
Information: (WebSockets transport) There was an error with the transport.
Error: Failed to start the transport 'WebSockets': Error: WebSocket failed to connect. The connection could not be found on the server, either the endpoint may not be a SignalR endpoint, the connection ID is not present on the server, or there is a proxy blocking WebSockets. If you have multiple servers check that sticky sessions are enabled.
Debug: Selecting transport 'ServerSentEvents'.
Debug: Sending negotiation request: https://<server>...
Trace: (SSE transport) Connecting.
Information: SSE connected to https://<server>...
Debug: The HttpConnection connected successfully.
Trace: (SSE transport) sending data. String data of length 32.
POST https://<server>... 404 (Not Found)
Debug: HttpConnection.stopConnection(undefined) called while in state Disconnecting.
Error: Connection disconnected with error 'Error: No Connection with that ID: Status code '404''.
Debug: Starting connection with transfer format 'Text'.
Debug: Sending negotiation request: https://<server>...
Debug: Selecting transport 'WebSockets'.
Trace: (WebSockets transport) Connecting.
Information: WebSocket connected to wss://<server>...
Debug: The HttpConnection connected successfully.

Server logs
(server 1)
[DBG] New connection _cm5IaOtqY7tD7suKOb08Q created.
[DBG] Sending negotiation response. (1)
[DBG] New connection GuXhVydEzL-8xXcSxibysA created.
[DBG] Sending negotiation response. (2)
[DBG] Establishing new connection.
[DBG] OnConnectedAsync started.
[DBG] Failed connection handshake.

(server 2)
[DBG] New connection RjoZW-BKBNMOa2UBW9yo-g created.
[DBG] Sending negotiation response.
[DBG] Establishing new connection.
[DBG] Socket opened using Sub-Protocol: 'null'.
[DBG] OnConnectedAsync started.
[DBG] Found protocol implementation for requested protocol: json.
[DBG] Completed connection handshake. Using HubProtocol 'json'.
cburaca
asked 2 years ago2736 views
1 Answer
0

First, don't consider the network (any network!) to be 100% reliable. There are so many things that can go wrong at different times during the lifetime of a session (WebSockets or other) or packet. If you are seeing intermittent connectivity issues there are many causes - far too many to go into here. That said...

The logs you have pasted show a 404 error being returned. That means that the client successfully connected to the server and the server responded indicating that the resource couldn't be found. In this case, end-to-end connectivity was achieved but something on the server threw an error. If you have multiple servers is it possible that they are not all configured identically?

Finally, NLB stickiness works when the client hits the same NLB node using the same source IP address. When the client does a DNS lookup for the NLB name, multiple IP addresses are returned and the client will choose one to connect to. On a second connection attempt, the client (which could be SignalR, the browser or the operating system) may choose a different IP address which will not deliver a "sticky" session.

To the comments above: The network should never be considered to be 100% reliable. In a failure scenario (say, the NLB node going down) then a client will definitely hit another node and stickiness is not assured. At an application level, you're going to have to deal with that so I'd encourage a design that doesn't require sticky sessions if at all possible because it's a better way to go.

profile pictureAWS
EXPERT
answered 2 years ago
  • Thanks for your reply.

    1. The servers are quite identical, because they are launched using the same AMI (in an autoscaling group). Besides, the issue happens with both servers, no matter the first that we hit.

    2. So, the AWS NLB stickiness depends also on the NLB node the client hits, and not only on the source client IP address? Is this caused by the different availability zones the NLB is in? If I make the NLB available in only one zone, that would solve the issue?

    3. Unfortunately, SignalR communication does require sticky sessions to be in place, to ensure the final connection is established with the same host that did the negotiation.

    1. Cool.
    2. It depends on both of those things - just the way it works. I would not recommend using only a single NLB node because that would open you up to other failure conditions. We strongly recommend being multi-AZ.
    3. Being a networking nerd and having done a lot of redundant data centre design (but nowhere near as much as the people who design AWS regions and availability zones), I think that's a very poor architectural choice on behalf of the designers of SignalR. If it were me I would very carefully consider my choice of framework. While it may make your life easier in some ways, it will make it harder in others should the components you rely on to be reliable are not. In the end, everything is a compromise.
  • A great quote from Werner Vogels (Amazon CTO) is "everything fails all the time". Phrased another way: Plan for failure. Highly recommended reading is therefore the Amazon Builders Library.

  • Ok, thanks, I see I still have to think about this.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions