Skip to content

Using Amazon Route 53 Traffic Flow Policy Records for Global Load Balancing

10 minute read
Content level: Advanced
1

In this article we show how to use Amazon Route Traffic Flow Policies for Global Load Balancing and respond to events that require shifting loads between regions.

Customers who are operating in multiple AWS regions often look for ways to emulate a Global Service Load Balancer (GSLB) type behavior with AWS services. There are currently two main ways to do this, one is with AWS Global Accelerator which operates inline with the customer traffic, and the other is with Amazon Route 53, which operates based on requests to the Domain Name System (DNS). In this article we will discuss one way to do this using Route 53.

Route 53 has a feature called Traffic Flows which allow you to create policy records using a visual interface or structured JSON document. These records allow you to state the flow for your DNS request through a structured intent. So let's take a scenario that you are operating in 6 regions: US-East-1, US-West-2, EU-West-1, EU-Central-1, AP-Southeast-2, and AP-Northeast-1. In each of these regions you have an AWS Application Load Balancer (ALB), which fronts your application. You want to serve your customers from the closest region based on latency data from the customer's public IP subnet to that AWS region. When a particular region becomes overwhelmed with requests, you want to redirect new requests to the two regions which are next closest. You do not have any data residency concerns and are most concerned with availability, followed by performance.

To accomplish this we will set up a Traffic Flow Policy Record for an A record (serving IPv4 requests) and then copy it to another record for an AAAA record (serving IPv6 requests). In the policy, as shown in Figure-1, we create a record called "FrontEnd-Record" which uses latency based routing to forward to the 6 regions we are running in. From there we create a weighted record in each region named "<region acronym>-Record" and for each region we target the load balancer in that region with a weight of 255 and the load balancer in the two regions closest with a weight of 1.

Figure - 1: a traffic flow policy that shows latency based routing to 6 regions with weighted record sets pointing all traffic to their primary cell with 0% to two secondary cells in different regions. Health checks are not visible in this screenshot.

By setting a weight of 255 for the region's load balancer and a weight of 1 for the fail-back regions load balancers, 99.2% of requests will be served to the region which is closest and 0.8% of requests will go to one of the other two regions. If the health check fails on the primary regional load balancer (we will discuss how we are health checking later) then resolvers issuing new DNS requests will be served the IPs of other two regions' load balancers. It is good to serve a small amount of requests by the other two regions' load balancers under normal conditions so you can understand behavior and resolve any challenges prior to an emergent failover condition. Note that you can set the weight of the failover regions to 0 and if the health check for the primary region's load balancer fails, you will see Route 53 only answer requests with a approximately 50/50 distribution between the two records in the record set with a weight of 0.

To instantiate the migration of traffic from on regions' load balancer to other regions we can do this in two ways. One way is using Route 53 Health Checks which are a highly available portion of the Route 53 data plane, and the other is by calling the control plane to update the weights in the Traffic Flow Policy Records. Let's discuss the pros and cons of each of these methods.

Weighting away from a load balancer with a health check allows for both automatic weight away and manual weight away in a highly available fashion which is impervious from availability drops in the Route 53 control plane that could occur on rare occasions. However, in the configuration we are using, it is an all-or-none venture to either serve requests from that region or not serve any new requests from that region, you cannot partially weight away. By using the control plane you can partially weight away by changing the weights to something like 100, 50 and 50, which would serve 50% of the requests from the primary region but shift 50% of new requests to one of the two other closest regions. To do this, you must use the Route 53 control plane, which means you must write your own automation to update the traffic flow policies, and you must account for possible availability drops in the control plane and fail back to using health checks to weight away in emergent conditions. You can read this blog to better understand managing failover with Route 53.

The suggestion here would be to use both of these methods to control routing to your application. Under normal circumstances you can set a CloudWatch Alarm to trigger automation that changes the weights when a certain request threshold is met in a region. You must also be aware of the number of requests your other two backup regions are serving and build some intelligence in how much load to shift away from the primary load balancer in that region to the backup region load balancers. Then if there is a complete availability drop on your primary load balancer, or an even higher request count is met, you can mark your load balancer as unhealthy and stop serving new traffic from it all together.

Figure-2 CloudWatch alarm for more than 10k connections to the load balancer

To get started, create CloudWatch alarms in each region for the metrics that are important for you to consider a load balancer healthy. You can see in Figure-2 I have set up an alarm for more than 10,000 connections to my load balancer using the default metric of RequestCount emitted by ALB. You can then create a Route 53 health check and attach it to that Alarm. In addition you can set up an HTTP health check to check a path on your ALB. These will both detect reasons to weight away from the ALB with hard data, but there can be soft or gray failures which are not detected by CloudWatch metrics or HTTP Health Checks. One way to handle these failures is using Route 53 Application Recovery Controller. (ARC) You can create a Cluster (be aware of the cost per cluster) and a control panel within that cluster. Then you can create a routing control for each region and a health check associated with that routing control. You can see in Figure-3 an example of the Control Panel with all of the routing controls. By using ARC we now have a highly available way to stop answering DNS queries for a particular ALB in the case of a gray/soft failure.

Figure-3: The Application Recovery Controller Control Panel with Routing Controls for each region and a Safety rule that ensures each least 4 of the 6 regions are always on. Not pictured is a health check that belongs to each routing control which is shown in Figure-4.

Now that we have an ARC health check, an HTTP health check and a CloudWatch Alarm based health check. The final step is to create a calculated health check which sets how many of these three health checks must be healthy for the ALB to be considered healthy; I have set it to require 3 of 3 to be healthy as shown in Figure-4.

Figure-4: Health Checks including the CloudWatch Alarm, an HTTP health check and the Application Recovery controller Health Checks for each region as well as the calculated health check that requires the HTTP, CloudWatch and Application Recovery controller health checks to all be healthy to show as healthy.

Finally, we can attach the calculated health check to the ALB in the traffic flow policy and create the policy record in the hosted zone. Once created we can use the control plane to manually shift weights gradually between primary and backup load balancers and use health checks for hard weight aways if a full failure happens or the control plane is unavailable. Note that when you instantiate a weight away, this will only apply for requests that are new. Clients with established TCP sessions will not do new DNS requests and thus not shift away from the load balancer unless their session fails and times out and then they do a new DNS request. Also note that resolvers and clients will cache the DNS answer for a request for as long as configured in the time-to-live (TTL) attribute of a resource record. If you are using Route 53 alias records as shown here, the TTL will be the value set by ALB (you can see this value by using the DIG utility to do a DNS request for your load balancer), if you use your own value then you can set the TTL.

This post shows how to use Route 53 Traffic Flow Policies to create a GSLB behavior for load balancing your requests across regions. You can read more about Traffic Flow Policies here and about the various types of routing policies here to see other options than the latency based routing option we used in this article. You can also see the following code example of the JSON document for the Traffic Flow outlined in this article with example values for the health check IDs and load balancers.

{
  "AWSPolicyFormatVersion": "2023-05-09",
  "Endpoints": {
    "USE1-Cell": {
      "Type": "application-load-balancer",
      "Value": "internal-alb1-12345.us-east-1.elb.amazonaws.com",
      "Region": "us-east-1"
    },
    "USW2-Cell": {
      "Type": "application-load-balancer",
      "Value": "internal-alb1-12345.us-west-2.elb.amazonaws.com",
      "Region": "us-west-2"
    },
    "EUW1-Cell": {
      "Type": "application-load-balancer",
      "Value": "internal-alb1-12345.eu-west-1.elb.amazonaws.com",
      "Region": "eu-west-1"
    },
    "EUC1-Cell": {
      "Type": "application-load-balancer",
      "Value": "internal-alb1-12345.eu-central-1.elb.amazonaws.com",
      "Region": "eu-central-1"
    },
    "APSE2-Cell": {
      "Type": "application-load-balancer",
      "Value": "internal-alb1-12345.ap-southeast-2.elb.amazonaws.com",
      "Region": "ap-southeast-2"
    },
    "APNE1-Cell": {
      "Type": "application-load-balancer",
      "Value": "internal-alb1-12345.ap-northeast-1.elb.amazonaws.com",
      "Region": "ap-northeast-1"
    }
  },
  "Rules": {
    "FrontEnd-Record": {
      "RuleType": "latency",
      "Regions": [
        {
          "EvaluateTargetHealth": true,
          "Region": "us-east-1",
          "RuleReference": "USE1-Record"
        },
        {
          "EvaluateTargetHealth": true,
          "Region": "us-west-2",
          "RuleReference": "USW2-Record"
        },
        {
          "EvaluateTargetHealth": true,
          "Region": "eu-west-1",
          "RuleReference": "EUW1-Record"
        },
        {
          "EvaluateTargetHealth": true,
          "Region": "eu-central-1",
          "RuleReference": "EUC1-Record"
        },
        {
          "EvaluateTargetHealth": true,
          "Region": "ap-southeast-2",
          "RuleReference": "APSE2-Record"
        },
        {
          "EvaluateTargetHealth": true,
          "Region": "ap-northeast-1",
          "RuleReference": "APNE1-Record"
        }
      ]
    },
    "USE1-Record": {
      "RuleType": "weighted",
      "Items": [
        {
          "EvaluateTargetHealth": false,
          "Weight": "255",
          "EndpointReference": "USE1-Cell",
          "HealthCheck": "1234abcd-5678-efgh-90ij-abcdef012345"
        },
        {
          "EvaluateTargetHealth": false,
          "Weight": "1",
          "EndpointReference": "USW2-Cell",
          "HealthCheck": "dcba3210-hgfe-7654-ji98-543210fedcba"
        },
        {
          "EvaluateTargetHealth": false,
          "Weight": "1",
          "EndpointReference": "EUW1-Cell",
          "HealthCheck": "efgh5678-1234-abcd-zy90-uvwxyz98765"
        }
      ]
    },
    "USW2-Record": {
      "RuleType": "weighted",
      "Items": [
        {
          "EvaluateTargetHealth": false,
          "Weight": "255",
          "EndpointReference": "USW2-Cell",
          "HealthCheck": "dcba3210-hgfe-7654-ji98-543210fedcba"
        },
        {
          "EvaluateTargetHealth": false,
          "Weight": "1",
          "EndpointReference": "USE1-Cell",
          "HealthCheck": "1234abcd-5678-efgh-90ij-abcdef012345"
        },
        {
          "EvaluateTargetHealth": false,
          "Weight": "1",
          "EndpointReference": "APSE2-Cell",
          "HealthCheck": "zyxw0987-vuts-6543-rq21-123456ponmlk"
        }
      ]
    },
    "EUW1-Record": {
      "RuleType": "weighted",
      "Items": [
        {
          "EvaluateTargetHealth": false,
          "Weight": "255",
          "EndpointReference": "EUW1-Cell",
          "HealthCheck": "efgh5678-1234-abcd-zy90-uvwxyz98765"
        },
        {
          "EvaluateTargetHealth": false,
          "Weight": "1",
          "EndpointReference": "EUC1-Cell",
          "HealthCheck": "8765hgfe-dcba-4321-09yz-56789zyxwvu"
        },
        {
          "EvaluateTargetHealth": false,
          "Weight": "1",
          "EndpointReference": "USE1-Cell",
          "HealthCheck": "1234abcd-5678-efgh-90ij-abcdef012345"
        }
      ]
    },
    "EUC1-Record": {
      "RuleType": "weighted",
      "Items": [
        {
          "EvaluateTargetHealth": false,
          "Weight": "255",
          "EndpointReference": "EUC1-Cell",
          "HealthCheck": "8765hgfe-dcba-4321-09yz-56789zyxwvu"
        },
        {
          "EvaluateTargetHealth": false,
          "Weight": "1",
          "EndpointReference": "EUW1-Cell",
          "HealthCheck": "efgh5678-1234-abcd-zy90-uvwxyz98765"
        },
        {
          "EvaluateTargetHealth": false,
          "Weight": "1",
          "EndpointReference": "USE1-Cell",
          "HealthCheck": "1234abcd-5678-efgh-90ij-abcdef012345"
        }
      ]
    },
    "APSE2-Record": {
      "RuleType": "weighted",
      "Items": [
        {
          "EvaluateTargetHealth": false,
          "Weight": "255",
          "EndpointReference": "APSE2-Cell",
          "HealthCheck": "zyxw0987-vuts-6543-rq21-123456ponmlk"
        },
        {
          "EvaluateTargetHealth": false,
          "Weight": "1",
          "EndpointReference": "USW2-Cell",
          "HealthCheck": "dcba3210-hgfe-7654-ji98-543210fedcba"
        },
        {
          "EvaluateTargetHealth": false,
          "Weight": "1",
          "EndpointReference": "APNE1-Cell",
          "HealthCheck": "7890wxyz-3456-stuv-12qr-klmnop654321"
        }
      ]
    },
    "APNE1-Record": {
      "RuleType": "weighted",
      "Items": [
        {
          "EvaluateTargetHealth": false,
          "Weight": "255",
          "EndpointReference": "APNE1-Cell",
          "HealthCheck": "7890wxyz-3456-stuv-12qr-klmnop654321"
        },
        {
          "EvaluateTargetHealth": false,
          "Weight": "1",
          "EndpointReference": "APSE2-Cell",
          "HealthCheck": "zyxw0987-vuts-6543-rq21-123456ponmlk"
        },
        {
          "EvaluateTargetHealth": false,
          "Weight": "1",
          "EndpointReference": "EUC1-Cell",
          "HealthCheck": "8765hgfe-dcba-4321-09yz-56789zyxwvu"
        }
      ]
    }
  },
  "RecordType": "A",
  "StartRule": "FrontEnd-Record"
}