Skip to content

DX to VPN Failover takes >2mins

0

Hi all,

We have a Hosted connection, deployed as a Transit VIF, using a Direct Connect Gateway connected to one Transit Gateway. We have BFD on. We also have a VPN backup connection. We are advertising the same prefixes over both, using local-pref and AS_PATH to prefer the DX link. The BFD successfully goes down during a failover, and from the on-premises side, the BGP goes down instantly and the VPN routes take over successfully.

From the Transit GW route table, we observed that the routes received from the Direct Connect Gateway remain there for about 2.5mins until they are replaced by routes propagated by the VPN, about 2.5mins.

Is this expected behaviour? Are there any settings we can tweak to reduce the failover time? It seems pointless to have BFD pulling the BGP down as fast as possible, only for the routes to remain active in the TGW for a long period of time.

PS. Upon further research, it seems like BGP attributes do not matter when it comes to same prefixes advertised . I don't think this has much of an effect on the problem though, which seems to be the time for the BGP routes to be dropped from the TGW.

Thanks in advance!

2 Answers
1
Accepted Answer

I came across this link which specifically says that graceful restart capability interferes with bfd fast failover. It so happens that this is enabled in our config. Will give it a whirl and update.

UPDATE: Disabling Graceful Restart indeed reduced the failover time from 2.5mins to 20s. The Amazon peer was probably waiting for the GR timer to expire when BFD goes down.

It would be great if it can be confirmed whether failover between 2 DXs would indeed be sub-second or 1-2 seconds since no propogation would be needed in that case.

answered 2 months ago
EXPERT
reviewed 2 months ago
0

Yes, this behavior is expected. The ~2.5 minute delay you're observing is related to how Transit Gateway handles BGP route withdrawal from Direct Connect connections.

When a Direct Connect BGP session goes down, even with BFD enabled, Transit Gateway doesn't immediately remove the routes from its route table. Instead, it waits for the BGP hold timer to expire before withdrawing those routes. The default BGP hold timer is typically 180 seconds (3 minutes), which aligns closely with the 2.5 minutes you're experiencing.

While BFD does detect the link failure quickly and tears down the BGP session on the customer side, the Transit Gateway side still respects the BGP hold timer before considering those routes invalid and removing them from the route table. This is why your VPN routes don't take over immediately despite the fast BFD detection.

Unfortunately, there aren't configurable settings on the Transit Gateway side to reduce this hold timer. The BGP timers are managed by AWS and cannot be adjusted by customers for Transit Gateway attachments.

To work around this limitation and achieve faster failover, you might consider:

  1. Ensuring your VPN connection is using dynamic routing with BGP, which enables automatic failover capabilities
  2. Regularly testing your failover scenarios to understand the actual recovery time in your environment
  3. Designing your applications to be resilient to brief connectivity interruptions

The 2-3 minute failover window is a known characteristic of DX-to-VPN failover scenarios with Transit Gateway, and while BFD helps on the customer side, it doesn't eliminate the route withdrawal delay on the AWS side.
Sources
HNREL04-BP03 Use dynamic routing for automatic failover - Hybrid Networking Lens - AWS Well-Architected Framework

answered 2 months ago
EXPERT
reviewed 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.