NLB is delaying first TCP message until TCP FIN

0

We are experiencing strange issue with TLS NLB. Basically some TCP clients fail to connect for several minutes to idle server behind NLB.

Client and server use FIX communication protocol and on server side we use NLB for SSL termination. According to the FIX protocol a client sends special LOGON message immediately after TCP connection is established. If LOGON message is not received after configurable number of seconds server closes TCP connection as idle. In FIX transmission protocol LOGON is always the first TCP {DATA} message.

What we observe is that for some clients this initial LOGON message delayed. Client-side wireshark recording shows that LOGON message is sent on time. Server side wireshark recording shows that LOGON message is delivered with delay equal to idle connection timeout. We can see it delivered right after our server kills supposedly idle connection using TCP {FIN+ACK} message.
We tried replacing our server software with netcat (nc) listening on the same port and problem remains.

Interestingly if the client keeps trying, eventually it is successful (within several minutes). What is also bizarre is that the problem is 100% reproducible only for one type of TCP client, while another type of TCP client always get their LOGON delivered on the first attempt.

Everything works fine without NLB or when NLB is setup without TLS. Setting TCP_NODELAY on client side has no effect. We've tried this kind of setup on two different AWS accounts and using slightly different settings. We compared wireshark recordings of both successful and unsuccessful LOGON attempts and the initial {SYN;SYN+ACK;ACK} TCP exchanges look identical.

Any ideas are welcome. We are prepared to use stunnel-running instance in place of NLB as a workaround.

Kind Regards,
Andy

Additional information: We found the key difference between two clients. The problem only happens when there is 500+ milliseconds delay between connection establishment and the first {DATA} TCP packet. If the first data message arrives to NLB sooner then it is getting delivered without problems.

asked 5 years ago670 views
1 Answer
0

For book keeping: AWS business support acknowledged this as NLB problem. They also suggested temporary workaround: use TLS (rather than bare TCP) as transport protocol between NLB and target service. We used STUNNEL with self-signed certificate (trusted by NLB). Workaround works.

answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions