Site-to-Site VPN Connection unstable aws to azure


Hi Everyone,

We currently have multiple Site-to-Site VPN Connections between Azure and AWS, (multiple accounts).
One tunnel, on One VPN connection constantly "flaps" due to aws failing to respond via DPD in time. (reported via Azure support)

The only difference between this VPN Connection and the others happens to be that the 'Local IPv4 Network Cidr' and 'Remote IPv4 Network Cidr' are set to On all our other VPN Tunnels this setting is blank.
I am unable to remove this setting. When trying to remove the Local & Remote Network Cidr, it stays in modifying and then eventually goes back tonan available state.

I am unsure if that could actually cause an issue but thought it would be a mention.

I do not see why increasing DPD would solve my issue. When one ipsec tunnel is stable, and one is not on the same VPN connection?

Does anyone have any ideas?

asked 2 years ago203 views
2 Answers

By default AWS has DPD at 30 seconds. Where as Azure has it at 45 seconds. Increasing both to 120 seconds has produced a stable tunnel in the end. Currently 18+ hours stable at least. Which is better than the previous 2hours.

It would be interesting if someone has an idea why the initial configuration works on 3 of our other tunnels, but this tunnel was the only one that constantly failed every 2 hours due to aws not responding via DPD... (based on what my Azure support says)

answered 2 years ago

Hello Tim,

DPD is generally the symptom of a problem and the fact that there was no DPD response, combined with the fact that it only happens for certain tunnels, seems to suggest there is potentially an underlying problem with network connectivity. Considering changing the timeout to 120 seconds seems to have fixed it, most likely means the blip likely lasts between 30 and 120 seconds. Its worth noting that network blips may not impact certain applications that have built in resiliency mechanisms and have the ability to re-establish connectivity and continue with packet exchange seamlessly, which may very well be the case here.
Further, if DPD timeout is set to 120 seconds on the AWS end, it means that the DPD "R_U_THERE" messages are sent every 10 seconds and will timeout only if 12 consecutive messages have not been responded to. This would mean that if you had an underlying network problem for 110 seconds, the tunnel will still remain online since the 12th DPD message was responded to and the timer will reset. This could be problematic if you have network sensitive applications but may not be a problem if the application is able to recover/re-establish as explained earlier. My recommendation:
if an application using this path is seeing problems, please get in touch with AWS Support via the Support portal from the account that the VPN lives in and mention:
a) The corresponding VPN ID(s) and region
b) Timestamps (with timezone) from when the problem was seen the last couple times and
c) Excerpts of the Azure logs that can be used to compare with that of our own logs

I'm confident we should be able to get to the bottom of this once we look at our logs.

NOTE: Please refrain from divulging any personal information around your AWS resources including Resource IDs, Public IPs and Security group rules to name a few since all posts are publicly available indefinitely. If you need pointed guidance, please reach out to us at AWS Support via the Support console.

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions