NLB does not switch between targets when AWS performing VPN tunnel maintenance.

0

I have java spring boot microservices that access to remote service through NLB endpoint how had 2 static routes as target group. One static route go through a site-to-site vpn (with just 1 active tunnel) and the other static route go through the other site-to-site vpn (with just 1 active tunnel). We use a NLB to balance between the VPNs because our customer doesn't support BGP. The problem is when AWS peform VPN tunnel maintenance. My NLB doesn't switch between targets (static routes), and my microservices loses the connection with the service. But on the other hand, when a generic failure occurs on some tunnel, the NLB detect this outage and balances the connection to the other target and my microservices do not lose the connection to the service.

3 Answers
2

Hi mdanieli20,

Please try this solution it will be helpful for resolve.

To Your Network Load Balancer (NLB) not switching between targets during AWS VPN tunnel maintenance, you can implement a solution by customizing the health check mechanism of the NLB. Start by creating a custom health check endpoint on your remote service that can accurately indicate the status of the VPN tunnel. This endpoint should respond with health information based on the VPN's operational state, allowing the NLB to detect when the tunnel is under maintenance. configure the NLB health checks to use this custom endpoint, ensuring that the health check settings (protocol, path, port, ETC) align with your endpoint’s configuration. Additionally, set up AWS CloudWatch alarms to monitor VPN tunnel maintenance events and trigger an AWS Lambda function to update the health status of the NLB targets accordingly. This Lambda function should deregister the affected targets from the NLB during maintenance periods. you can improve the NLB's ability to detect and respond to VPN tunnel issues, ensuring better connectivity and reliability for your microservices.

If you want more information, please go through the AWS Document.

https://aws.amazon.com/about-aws/whats-new/2018/09/network-load-balancer-now-supports-aws-vpn/

https://aws.amazon.com/blogs/networking-and-content-delivery/application-load-balancer-type-target-group-for-network-load-balancer/

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html

EXPERT
answered a month ago
1

Hi! Could you explain a bit more what is the target type you're using on NLB? I would assume you use some active IP address, correct? Who owns (device/system/endpoint) that IP address which you register as a target?

Also, have you tracked what happens during AWS maintenance - I would assume that IP stays active/reachable and we keep sending traffic to it - but I would like first to answer question above , so we could start solutioning.

profile pictureAWS
AWSAmir
answered a month ago
  • Hi, the target type is IP address. These IP addresses are services exposed from a provider's network that can be accessed via a site-to-site VPN.

    { "TargetGroups": [ { "TargetGroupArn": "arn:aws:elasticloadbalancing:us-east-1:xxxxxxxxxxx:targetgroup/xxxxxxxxx-target-group/xxxxxxxxx", "TargetGroupName": "xxxxxxxxxx-target-group", "Protocol": "TCP", "Port": 13010, "VpcId": "vpc-xxxxxxxxxxxxx", "HealthCheckProtocol": "TCP", "HealthCheckPort": "13010", "HealthCheckEnabled": true, "HealthCheckIntervalSeconds": 5, "HealthCheckTimeoutSeconds": 2, "HealthyThresholdCount": 2, "UnhealthyThresholdCount": 2, "LoadBalancerArns": [ "arn:aws:elasticloadbalancing:us-east-1:xxxxxxxxx:loadbalancer/net/xxxxxxxxxxx/xxxxxxxxxxx" ], "TargetType": "ip", "IpAddressType": "ipv4" } ] }

    No, during AWS maintenance, that IP is unreachable because VPN Tunnel is down. (Remember the provider's network don't support BGP).

  • These are the attributes configured in the target group:

    ATTRIBUTES proxy_protocol_v2.enabled false ATTRIBUTES target_group_health.unhealthy_state_routing.minimum_healthy_targets.count 1 ATTRIBUTES preserve_client_ip.enabled false ATTRIBUTES stickiness.enabled false ATTRIBUTES target_group_health.unhealthy_state_routing.minimum_healthy_targets.percentage off ATTRIBUTES deregistration_delay.timeout_seconds 300 ATTRIBUTES target_group_health.dns_failover.minimum_healthy_targets.count 1 ATTRIBUTES stickiness.type source_ip ATTRIBUTES target_health_state.unhealthy.connection_termination.enabled true ATTRIBUTES deregistration_delay.connection_termination.enabled true ATTRIBUTES target_health_state.unhealthy.draining_interval_seconds 0 ATTRIBUTES load_balancing.cross_zone.enabled false ATTRIBUTES target_group_health.dns_failover.minimum_healthy_targets.percentage off

  • HI! So from this config I can see/deduct the following:

    • There will be some time (5 seconds + 2 seconds + 2 seconds or something like that) until your target on the "primary" tunnel gets reported as OutofService-> this will cause interruption, in principle. Is this what you're seeing? You may want to try to tune timeouts?
    • another hypothesis here is that during maintenance, another path does not get traffic forwarded quickly enough (that is, say primary VPN tunnel is down, and Customer side of the tunnel needs to re-route the traffic on their side through a secondary tunnel which is now active). If that does not happen/does not happen fast-enough, you may have unidirectional traffic (AWS VPN sends over a secondary already, but return traffic does not enter the tunnel on the "other" side yet? That is something you can check/prove with either a traffic capture or IPSec counters stats on the Customer side of VPN tunnel). Is VPN on 'far" end terminated on the same device? Do you see from the logs that switchover happens quickly (you should see IPSec going down and then phase 2 IPSec counters starting to increase for in/out traffic)? If different devices, how does routing switchover happen?
0

Analysis

NLB Health Checks:

**NLB Health Check Configuration: ** NLB relies on health checks to determine the status of its targets. If the health checks do not detect that a tunnel is down during maintenance (because the tunnel might be up but not forwarding traffic), the NLB will continue to route traffic to that tunnel.

Tunnel Maintenance: During AWS VPN tunnel maintenance, the tunnel might remain in a state where it technically isn't down (so the health check sees it as "healthy"), but it's not forwarding traffic properly, causing the connection issues you're seeing.

Health Check Sensitivity:

Health Check Port and Protocol: Ensure that the health check is configured on a port and protocol that will accurately reflect the availability of the VPN tunnel. You might want to use a more sensitive protocol like TCP if you're currently using HTTP/HTTPS, or vice versa depending on what better reflects tunnel health.

Health Check Interval and Unhealthy Threshold: Adjusting these settings might help the NLB detect issues faster, though this also depends on the maintenance behavior. A more aggressive health check could lead to quicker failover.

Static Routes and NLB Behavior:

**Static Route Behavior: **Since you're using static routes without BGP, the NLB doesn't have dynamic feedback on the path availability. When AWS performs maintenance, the NLB might not have the immediate feedback needed to failover, unlike in a generic failure scenario where the tunnel would go completely down. Potential Solutions

Enhanced Health Check Configuration:

Use a Custom Health Check Endpoint: Consider setting up a custom health check endpoint on the remote service that actively tests the ability to reach the service through the tunnel, rather than just checking tunnel availability.

Increase Health Check Frequency: Increase the frequency of health checks and lower the unhealthy threshold to detect issues more quickly. Fallback Mechanism in Microservices:

Application-Level Failover: Implement application-level logic in your microservices to detect and handle cases where the NLB is not switching as expected. This could include retries with backoff or even manual re-routing logic.

Alternative VPN Setup:

**Redundant VPN Setup: **Consider setting up redundant VPN tunnels and configuring your NLB to target both, rather than relying on static routes. This can be more robust but requires your customer to support such a configuration.

Multi-AZ Setup: Ensure your VPN configuration is spread across multiple Availability Zones to mitigate the impact of maintenance on any single zone.

Monitor and Automate Failover:

Monitoring and Alerts: Set up monitoring on the VPN tunnels and the NLB target groups, and configure automated scripts to intervene if a tunnel is in maintenance but not properly failing over.

Custom Routing Logic: If feasible, implement a custom routing mechanism or an additional layer of load balancing that can detect and handle these cases more effectively.

Next Steps Review and adjust NLB health checks to be more aggressive and sensitive to the actual availability of the service.

Test the application behavior during planned maintenance to simulate tunnel maintenance and ensure failover is working as expected.

Explore application-level failover mechanisms if adjusting the NLB configuration does not resolve the issue.

EXPERT
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions