We have a SignalR javascript client connecting to .net core hub, hosted in AWS.
Both client and server use version 6.
More than one backend server may exist, so there is an internet facing Network Load Balancer forwarding the traffic to the backend servers.
The NLB is configured with these options:
- Stickiness
- Preserve client IP addresses
Most of the time, everything works great: the negotiation and the connection upgrade.
Sometimes, however, something strange happens: the negotiation fails (error in WebSocket transport), then the client tries again with another transport (SSE).
This transport also fails and, while retrying, the client hits the other host, starting the negotiation again.
Finally, the connection succeeds.
All this process takes no more than 5 seconds.
This was happening in our clients, from outside, so we set up an isolated environment to debug this situation, with the NLB and 2 backend hosts.
This is internal, so no one else is connecting, for sure, so there is no chance hosts are overloaded.
We are completely sure our IP does not change while the test is being done.
We enabled client and server debug logs, shown below.
This still happens sometimes, no matter the host that we hit on the first attempt.
We know that we can configure the client to skip the negotiation, but that will make us lose about 10% of our clients, because we will be limited to use the WebSockets transport.
From the logs, IP stickiness seems to be failing somehow...
What is misconfigured in our setup?
How can the negotiation fail if just one client is connecting and the IP does not change?
What else can we configure in the AWS NLB to ensure the IP stickiness?
Thanks in advance!
When the connection succeeds at the first attempt
Client logs
Debug: Starting connection with transfer format 'Text'.
Debug: Sending negotiation request: https://<server>...
Debug: Selecting transport 'WebSockets'.
Trace: (WebSockets transport) Connecting.
Information: WebSocket connected to wss://<server>...
Debug: The HttpConnection connected successfully.
Server logs
[DBG] New connection r3Gv5PrBgTA2T6lijqwTrA created.
[DBG] Sending negotiation response.
[DBG] Establishing new connection.
[DBG] Socket opened using Sub-Protocol: 'null'.
[DBG] OnConnectedAsync started.
[DBG] Found protocol implementation for requested protocol: json.
[DBG] Completed connection handshake. Using HubProtocol 'json'.
When the connection fails at the first attempt
Client logs
Debug: Starting connection with transfer format 'Text'.
Debug: Sending negotiation request: https://<server>...
Debug: Selecting transport 'WebSockets'.
Trace: (WebSockets transport) Connecting.
WebSocket connection to 'wss://<server>...' failed:
Information: (WebSockets transport) There was an error with the transport.
Error: Failed to start the transport 'WebSockets': Error: WebSocket failed to connect. The connection could not be found on the server, either the endpoint may not be a SignalR endpoint, the connection ID is not present on the server, or there is a proxy blocking WebSockets. If you have multiple servers check that sticky sessions are enabled.
Debug: Selecting transport 'ServerSentEvents'.
Debug: Sending negotiation request: https://<server>...
Trace: (SSE transport) Connecting.
Information: SSE connected to https://<server>...
Debug: The HttpConnection connected successfully.
Trace: (SSE transport) sending data. String data of length 32.
POST https://<server>... 404 (Not Found)
Debug: HttpConnection.stopConnection(undefined) called while in state Disconnecting.
Error: Connection disconnected with error 'Error: No Connection with that ID: Status code '404''.
Debug: Starting connection with transfer format 'Text'.
Debug: Sending negotiation request: https://<server>...
Debug: Selecting transport 'WebSockets'.
Trace: (WebSockets transport) Connecting.
Information: WebSocket connected to wss://<server>...
Debug: The HttpConnection connected successfully.
Server logs
(server 1)
[DBG] New connection _cm5IaOtqY7tD7suKOb08Q created.
[DBG] Sending negotiation response. (1)
[DBG] New connection GuXhVydEzL-8xXcSxibysA created.
[DBG] Sending negotiation response. (2)
[DBG] Establishing new connection.
[DBG] OnConnectedAsync started.
[DBG] Failed connection handshake.
(server 2)
[DBG] New connection RjoZW-BKBNMOa2UBW9yo-g created.
[DBG] Sending negotiation response.
[DBG] Establishing new connection.
[DBG] Socket opened using Sub-Protocol: 'null'.
[DBG] OnConnectedAsync started.
[DBG] Found protocol implementation for requested protocol: json.
[DBG] Completed connection handshake. Using HubProtocol 'json'.
Thanks for your reply.
The servers are quite identical, because they are launched using the same AMI (in an autoscaling group). Besides, the issue happens with both servers, no matter the first that we hit.
So, the AWS NLB stickiness depends also on the NLB node the client hits, and not only on the source client IP address? Is this caused by the different availability zones the NLB is in? If I make the NLB available in only one zone, that would solve the issue?
Unfortunately, SignalR communication does require sticky sessions to be in place, to ensure the final connection is established with the same host that did the negotiation.
A great quote from Werner Vogels (Amazon CTO) is "everything fails all the time". Phrased another way: Plan for failure. Highly recommended reading is therefore the Amazon Builders Library.
Ok, thanks, I see I still have to think about this.