Why does my application have latency and performance issues when I use Site-to-Site VPN?

7 minute read
0

My on-premises application has latency issues when I use an AWS Site-to-Site VPN to access resources in AWS.

Resolution

If you're experiencing performance or latency issues when you use Site-to-Site VPN to access resources, then follow these troubleshooting steps:

  • Isolate the source and destination systems one at a time.
  • Check the network path for issues that might be causing latency.
  • Check your application for common errors that cause latency issues.

Isolate your source and destination systems

To mitigate performance issues between your on-premises application and AWS, first isolate the source and destination systems. Then, use network tools to check for loss and latency outside the application that might be directly impacting performance.

1.    Change the source and destination. Use a different source and then a different destination, and check if the problem persists after each change. Then, check the device to determine if there's an operating system (OS) configuration issue or another issue.

2.    Perform a UDP bandwidth capabilities test. Performance issues can indicate throughput problems, so use the iperf3 tool to check your provisioned bandwidth. Perform this test bidirectionally. The following example UDP test uses the iperf3 tool.

Note: -i refers to interval, -u refers to UDP, -b refers to bandwidth (adjust accordingly), -p refers to the UDP port, and -v refers to verbose.

Server: sudo iperf -s -u [-p 5001]
client: sudo iperf3 -i 1 -u -p 33344 -b 1.2G -c <private IP> -V

Note: Make sure that your bandwidth credit is available for the Amazon Elastic Compute Cloud (Amazon EC2) instance that you're using. Or, try using a larger instance size, and then test again.

3.    Use iperf3 to perform a TCP throughput test on your Site-to-Site VPN. Perform this test bidirectionally. See the following example:

Note: For optimal performance, try different TCP receive window sizes to test the source and destination memory buffers when you're increasing the instance size.

Server : iperf3 -s [-p 5001]
Client:
sudo iperf3 -c <Private IP> -P 10 -w 128K -V
sudo iperf3 -c <Private IP> -P 10 -w 512K -V
sudo iperf3 -c <Private IP> -P 10 -w 1024K -V   

Check the network path for issues

Check the network patch to identify the specific hop or device that's causing issues on the network:

  • Check for packet loss along the path between the Site-to-Site VPN peers.
  • Check for Site-to-Site VPN tunnel throughput.
  • Check the customer gateway router configuration.
  • Check the lowest MTU of the path.

Check for packet loss along the path between Site-to-Site VPN peers

Your Site-to-Site VPN tunnel is an encrypted communication between peers. But, the underlying network might be exhibiting loss that impacts the quality of the encrypted communication. Packet loss increases latency and directly impacts throughput.

See the following equation and example:

Mathis Equation: Throughput = (MSS/RTT)*(1 / sqrt{p}): 
Eg. 1375 MSS, 33ms latency, 0.001%= (1375B / 0.033 sec) * (1/𝑠𝑞𝑟𝑡{0.001})=  (41,666.6 * 31.6227)*8 <<< To go from Bps to bps= 10,540,925.5 bps (10.5 Mbps)

The measure of throughput is [TCP Window Size in bits] / [Latency (RTT) in seconds]. See the following example:

Eg.  64K in receive Window, 33ms latency= 524288 bits / 0.033 = 15,887,515 = 15.8 Mbps MAX Possible Throughput

To check for packet loss on the public path between Site-to-Site VPN peers, use an ICMP test, such as MTR. For more information on installing and using MTR, see How do I troubleshoot network performance issues between EC2 Linux or Windows instances in a VPC and an on-premises host over the internet gateway?

See the following example:

Note: The MTR output in this example includes values with no data or 100% loss. This indicates that the device dropped the packet with a TTL of 0, but didn't reply with an ICMP time exceeded (Type 11, Code 0) message. So, these values don't indicate a problem.

 [ec2-user@ip-10-7-10-67 ~]$ sudo mtr --no-dns --report --report-cycles 20 18.189.121.166Start: 2023-04-07T16:28:28+0000HOST: ip-10-7-10-67.ec2.internalr Loss%   Snt   Last   Avg  Best  Wrst StDev  1.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0  2.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0  3.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0  4.|-- 241.0.12.14                0.0%    20    0.4   0.4   0.3   0.8   0.1  5.|-- 240.0.204.2                0.0%    20    0.4   0.4   0.3   0.5   0.0  6.|-- 240.0.204.17               0.0%    20    0.4   0.4   0.3   0.5   0.0  7.|-- 240.0.204.5                0.0%    20    0.4   0.4   0.4   0.5   0.0  8.|-- 242.2.74.145               0.0%    20    1.2   4.0   0.4  23.9   5.7  9.|-- 52.93.29.71                0.0%    20    0.8   2.3   0.7   9.2   2.8 10.|-- 100.100.8.66               0.0%    20   10.8   2.5   0.7  12.8   4.0 11.|-- 100.92.53.85               0.0%    20   26.0  13.3  11.0  26.0   4.4 12.|-- 52.93.239.5                0.0%    20   11.6  12.8  11.4  23.7   2.7 13.|-- 52.95.1.159                0.0%    20   11.0  12.0  11.0  18.3   1.7 14.|-- 52.95.1.186                0.0%    20   11.5  14.1  11.2  32.6   5.9 15.|-- 15.230.39.135              0.0%    20   11.6  11.9  11.1  15.5   1.1 16.|-- 15.230.39.124              0.0%    20   11.7  12.8  11.2  27.2   3.6 17.|-- 108.166.252.38             0.0%    20   11.2  11.2  11.1  11.3   0.0 18.|-- 242.0.102.17               0.0%    20   12.1  12.4  11.2  23.9   2.8 19.|-- 108.166.252.35             0.0%    20   11.3  11.3  11.2  12.3   0.2 20.|-- 241.0.12.207               0.0%    20   11.2  11.3  11.1  13.2   0.5 21.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0 22.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0 23.|-- ???                       100.0    20    0.0   0.0   0.0   0.0   0.0 24.|-- 100.65.30.129              0.0%    20   57.2  24.9  11.6  76.4  17.9 25.|-- 18.189.121.166             0.0%    20   11.3  11.8  11.2  17.6   1.6[ec2-user@ip-10-7-10-67 ~]$

Check for Site-to-Site VPN tunnel throughput

Check if your throughput is breaching the limit of 1.2 Gbps:

1.    Open the Amazon CloudWatch console to view the Site-to-Site VPN metrics.

2.    Choose the metrics for TunnelDataIn and TunnelDataOut.

3.    For Statistic, choose Sum, and then for Period, choose 5 minutes.

4.    Apply the following calculation to the data points at their peak. In this equation, m1 = TunnelDataIn, and m2 = TunnelDataOut.

Note: If throughput is more than 1.2 Gbps for a sustained period, then implement two BGP tunnels with ECMP and transit gateway.

(((m1+m2)/300)*8)/1000000

Check your customer gateway router configuration

Check your customer gateway device for the following configurations:

  • Make sure that there's no policing or shaping policies limiting throughput.
  • Reset the Don't Fragment (DF) flag in the IP packets.
  • Make sure that you fragment the IPSec packets before you encrypt them.
  • Confirm that the customer gateway has MSS configuration so that IP, TCP, UDP or ESP headers and data payload don't exceed 1500. Follow the MTU guidelines for the encryption algorithm that you're using. For more information, see Best practices for your customer gateway device.

Check the lowest MTU of the path

Test the path to make sure that the path's lowest MTU is what's expected:

To do this, ping -s 1460 <DESTINATION> -M do:

[ec2-user@ip-10-7-10-67 ~]$ ping -s 1460 1.1.1.1 -M doPING 1.1.1.1 (1.1.1.1) 1460(1488) bytes of data.1468 bytes from 1.1.1.1: icmp_seq=1 ttl=51 time=1.06 ms1468 bytes from 1.1.1.1: icmp_seq=2 ttl=51 time=1.04 ms1468 bytes from 1.1.1.1: icmp_seq=3 ttl=51 time=1.10 ms1468 bytes from 1.1.1.1: icmp_seq=4 ttl=51 time=1.07 ms1468 bytes from 1.1.1.1: icmp_seq=5 ttl=51 time=1.10 ms1468 bytes from 1.1.1.1: icmp_seq=6 ttl=51 time=1.06 ms

 If a device along the path can't transport the payload, then it returns an ICMP path MTU exceeded message:

[ec2-user@ip-10-7-10-67 ~]$ ping -s 1480 1.1.1.1 -M doPING 1.1.1.1 (1.1.1.1) 1480(1508) bytes of data.From 10.7.10.1 icmp_seq=1 Frag needed and DF set (mtu = 1500)ping: local error: Message too long, mtu=1500ping: local error: Message too long, mtu=1500ping: local error: Message too long, mtu=1500ping: local error: Message too long, mtu=1500

Check your application for common errors

Check your on-premises application for the most common issues:

  • Application configuration issues.
  • Use of parallel threads in the data transfer. If you're observing what appears to be slower than expected Site-to-Site VPN throughput, then use parallel threads to confirm throughput apart from the application.
  • Implementation of the retry algorithm with exponential backoff. If you see timeouts when calling AWS services, then use the retry algorithm with exponential backoff.

Related information

Enhanced networking on Linux

AWS VPN FAQs

AWS OFFICIAL
AWS OFFICIALUpdated a year ago