NLB is delaying first TCP message until TCP FIN

0

We are experiencing strange issue with TLS NLB. Basically some TCP clients fail to connect for several minutes to idle server behind NLB.

Client and server use FIX communication protocol and on server side we use NLB for SSL termination. According to the FIX protocol a client sends special LOGON message immediately after TCP connection is established. If LOGON message is not received after configurable number of seconds server closes TCP connection as idle. In FIX transmission protocol LOGON is always the first TCP {DATA} message.

What we observe is that for some clients this initial LOGON message delayed. Client-side wireshark recording shows that LOGON message is sent on time. Server side wireshark recording shows that LOGON message is delivered with delay equal to idle connection timeout. We can see it delivered right after our server kills supposedly idle connection using TCP {FIN+ACK} message.
We tried replacing our server software with netcat (nc) listening on the same port and problem remains.

Interestingly if the client keeps trying, eventually it is successful (within several minutes). What is also bizarre is that the problem is 100% reproducible only for one type of TCP client, while another type of TCP client always get their LOGON delivered on the first attempt.

Everything works fine without NLB or when NLB is setup without TLS. Setting TCP_NODELAY on client side has no effect. We've tried this kind of setup on two different AWS accounts and using slightly different settings. We compared wireshark recordings of both successful and unsuccessful LOGON attempts and the initial {SYN;SYN+ACK;ACK} TCP exchanges look identical.

Any ideas are welcome. We are prepared to use stunnel-running instance in place of NLB as a workaround.

Kind Regards,
Andy

Additional information: We found the key difference between two clients. The problem only happens when there is 500+ milliseconds delay between connection establishment and the first {DATA} TCP packet. If the first data message arrives to NLB sooner then it is getting delivered without problems.

已提問 5 年前檢視次數 696 次
1 個回答
0

For book keeping: AWS business support acknowledged this as NLB problem. They also suggested temporary workaround: use TLS (rather than bare TCP) as transport protocol between NLB and target service. We used STUNNEL with self-signed certificate (trusted by NLB). Workaround works.

已回答 5 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南