- Newest
- Most votes
- Most comments
I came across this link which specifically says that graceful restart capability interferes with bfd fast failover. It so happens that this is enabled in our config. Will give it a whirl and update.
UPDATE: Disabling Graceful Restart indeed reduced the failover time from 2.5mins to 20s. The Amazon peer was probably waiting for the GR timer to expire when BFD goes down.
It would be great if it can be confirmed whether failover between 2 DXs would indeed be sub-second or 1-2 seconds since no propogation would be needed in that case.
Yes, this behavior is expected. The ~2.5 minute delay you're observing is related to how Transit Gateway handles BGP route withdrawal from Direct Connect connections.
When a Direct Connect BGP session goes down, even with BFD enabled, Transit Gateway doesn't immediately remove the routes from its route table. Instead, it waits for the BGP hold timer to expire before withdrawing those routes. The default BGP hold timer is typically 180 seconds (3 minutes), which aligns closely with the 2.5 minutes you're experiencing.
While BFD does detect the link failure quickly and tears down the BGP session on the customer side, the Transit Gateway side still respects the BGP hold timer before considering those routes invalid and removing them from the route table. This is why your VPN routes don't take over immediately despite the fast BFD detection.
Unfortunately, there aren't configurable settings on the Transit Gateway side to reduce this hold timer. The BGP timers are managed by AWS and cannot be adjusted by customers for Transit Gateway attachments.
To work around this limitation and achieve faster failover, you might consider:
- Ensuring your VPN connection is using dynamic routing with BGP, which enables automatic failover capabilities
- Regularly testing your failover scenarios to understand the actual recovery time in your environment
- Designing your applications to be resilient to brief connectivity interruptions
The 2-3 minute failover window is a known characteristic of DX-to-VPN failover scenarios with Transit Gateway, and while BFD helps on the customer side, it doesn't eliminate the route withdrawal delay on the AWS side.
Sources
HNREL04-BP03 Use dynamic routing for automatic failover - Hybrid Networking Lens - AWS Well-Architected Framework
Relevant content
- asked 2 years ago
- asked 6 years ago
- asked 7 years ago
