TLS handshake failure on Gen5 HughesNet transport

Question

Hello,  
  
My company has built an IoT water treatment product that is currently in soft release.  (https://dropconnect.com)  Our WiFi-enabled hub device runs the AWS C IoT SDK and uses the ARM Mbed TLS library for TLS encryption.  We have many systems deployed that are working beautifully, but our system is unable to connect to AWS if it is using HughesNet Gen5 satellite internet service.  Essentially, the TLS handshake proceeds to the point where the client (our device) sends the client cipher spec message and is expecting the server cipher spec message, but that message never arrives.  The implication is that something about the HughesNet satellite link is causing a handshake message to get dropped, or that there's something about the client cipher spec message that is invalid and AWS is silently terminating the connection.  
  
I have ruled out network latency as a contributing factor, both by simulating a network connection with latencies much higher than what is present on a satellite network connection, and also testing on a HughesNet Gen4 system -- which, oddly, works just fine.  While using the HughesNet Gen5 link, the first 11 steps of the TLS handshake exactly match the progression seen during a successful connection over a different internet link. (obviously, the randomized data is different)  I believe this rules out DNS lookups and basic network transport as being the problem.  I have also disabled any caching or web acceleration features on the satellite modem without any change.  
  
Is there any way I can learn why a TLS handshake has failed from the perspective of the server at AWS?  Any suggestions on how I might diagnose this problem would be appreciated.  
  
Thanks,  
  
Patrick

Answer

I believe my problem has been resolved.  I don't yet fully understand the root cause, but I found an anomaly in the way that the first TLS handshake response from the AWS server and traced it to the size of a TCP socket read buffer.  By increasing the buffer size from the default value of 1024 to >1400, the connection problems using a HughesNet satellite link disappeared.  (I used a value of 1460, which is the typical max TCP MTU of 1500 less two 20 byte headers; seems to be fairly common practice)  
  
The HughesNet transport must be doing something atypical with the packet framing that our project wasn't handling well.  Increasing this internal buffer size solved the problem.  
  
Patrick

Answer

Alex,  
  
Thanks for your reply; I have been working on pursuing your suggestions.  
  
We are using X.509 certificates, but the failure does not produce any alert messages.  I've turned on detailed logging in Mbed TLS and don't see anything useful in the output.  Using the HughesNet Gen5 link, the TLS handshake goes into a read and retry loop after sending the client change cipher spec message.  I've logged the network traffic with WireShark and have confirmed that the client change cipher spec message is transmitted, but no response is ever sent by AWS.  
  
The latest thing I've been working on is modifying a sample Visual Studio project included with Mbed TLS to connect to our AWS endpoint and use the same X.509 certificates that are assigned to one of our test devices.  Running on my Windows workstation, that project is able to complete the TLS handshake both via our landline ISP as well as the HughesNet Gen5 link.  The implication is that the problem is timing-related, there is a subtle difference in the order of operations between the two projects, or something is wrong with Microchip's TCP/IP stack library.  I'll continue digging and will add to this thread if I have additional questions or find the underlying cause.  
  
Thanks,  
  
Patrick

Answer

Hi Patrick,  
  
During that phase of the handshake, the client will send the following messages sequentially (from the RFC https://tools.ietf.org/html/rfc5246#page-33):  
Certificate*  
ClientKeyExchange  
CertificateVerify*  
\[ChangeCipherSpec]  
Finished  
  
The server will read those messages in order and could be cancelling the handshake while processing messages before ChangeCipherSpec. Is your client using X.509 certificates? If so, the Certificate and CertificateVerify messages will also be sent and are frequently the cause of handshake failures.  
  
Another way to get more information about the handshake failure would be to make sure you have debug logs turned on for MbedTLS and see if any TLS alerts are sent. You could also use the same AWS IoT endpoint and the same client certificates with another simple client program like openssl s_client and use the -debug or -msg flag to get more information about any TLS-layer errors. This would also rule out any issues with certificates.  
  
-Alex

TLS handshake failure on Gen5 HughesNet transport

관련 콘텐츠