Hi everyone,
I’m currently in the middle of refining a hybrid cloud architecture and I’ve run into a bit of a performance wall that I'm hoping the community can help me troubleshoot.
On the local side, I’m running a fairly intensive data-processing environment. We recently moved to a dual-socket setup using 52-core Xeon processors—specifically the 2.10GHz models with the 16GT/s UPI links. These chips are absolute monsters for our local multithreaded workloads, and the UPI speed has been a game-changer for keeping data moving between the sockets without the usual latency we saw on older builds. However, the "real world" problem starts as soon as that data needs to leave the local rack and hit our AWS VPC.
I was recently reading a thread where someone was asking if a standard consumer router could handle an AWS Site-to-Site VPN. It mentioned the "Tunnel interface" requirement, which really got me thinking about my own gateway setup. While I’m obviously not using a home router for this, I am seeing some very similar symptoms to what you’d expect from underpowered hardware. Even with a dedicated enterprise gateway, the IPsec overhead is becoming a massive bottleneck.
When my 52-core rig starts pushing processed chunks to the cloud, the encryption engine on the gateway seems to completely saturate. It’s frustrating because the internal bus speeds (thanks to that 16GT/s UPI) are so fast that the data is essentially waiting in line to be encrypted and sent through the tunnel interface. I’ve tried tweaking the IKE policies and moving to AES-GCM to reduce the overhead, but the throughput still doesn't feel like it’s reflecting the power of the compute sitting behind it.
I’m starting to wonder if I should ditch the hardware appliance approach and move toward a software-based Customer Gateway (CGW) running on a dedicated instance that can actually leverage higher core counts and modern instruction sets to handle the encryption.
My personal insight from years of building these rigs is that we often over-spec the compute and under-spec the "pipe" encryption capability. It feels like I've built a Ferrari but I’m trying to drive it through a narrow tunnel.
Has anyone else dealt with these kinds of throughput limitations when bridging high-performance on-prem Xeon clusters with AWS? Specifically, do you find that a software-based appliance scales better with high-core counts for the tunnel interface, or is there a specific hardware offload trick I might be missing to keep up with a 16GT/s interconnect?
Your diagnosis is spot-on: you've over-provisioned compute and under-provisioned crypto throughput at the edge. A few directions to explore: Software-based CGW on dedicated cores Yes, this tends to scale better than a fixed-ASIC appliance when you have cores to spare. Options: StrongSwan or Libreswan on a dedicated Linux box with AES-NI — modern Xeons (including your 52-core parts) have hardware AES-NI and AVX-512 that software IPsec can exploit across multiple cores. A single core with AES-NI can typically push 5–10 Gbps of AES-GCM; parallelize across cores and you scale linearly. VPP (fd.io) with IPsec plugin — user-space data plane that can saturate 40–100 Gbps links using DPDK + AES-NI across multiple cores. Much higher throughput than kernel-based IPsec. Commercial software appliances (e.g., Cisco CSR1000v, Palo Alto VM-Series) that can be allocated more vCPUs. Multi-tunnel / ECMP approach AWS Site-to-Site VPN caps at ~1.25 Gbps per tunnel. If you need more:
Use multiple VPN connections with ECMP routing (AWS Transit Gateway supports this — up to 50 Gbps aggregate with enough tunnels). Alternatively, consider AWS Direct Connect with MACsec encryption if you need consistent high throughput without the IPsec overhead. Hardware offload tricks you might be missing Confirm your gateway appliance is actually using AES-NI / QAT (QuickAssist) offload — some enterprise boxes have crypto accelerator cards (Intel QAT) that need to be explicitly enabled.
Check if your appliance supports multi-SA parallelism — some devices serialize all traffic through a single crypto engine regardless of tunnel count. NIC-level IPsec offload (inline IPsec on SmartNICs like Intel E810 or NVIDIA ConnectX-6 Dx) can move encryption entirely off the CPU.