Browser's DNS and Socket cache prevents it from routing to load balanced (Route 53)) Active region

0

We have 2 AWS regions in active mode. Services in Region-1 and Region-2 have health checks registered with Route53 which is setup for 'latency based routing'. The Route53 TTL is 30 seconds.

Here is the scenario:

  • Step 1: I access the public URL of a service using a Browser and it is getting served from Region-1. This service is available in 2 AWS regions (us-west-2 and us-east-2). We shall refer to them as Region-1 and Region-2.
  • This service is configured with Route 53 to provide high availability. Health endpoints are also registered with Route 53.
  • Step 2: Now I put my Region-1 down (to simulate latency).
  • Route53 marks it unhealthy in a couple of minutes as expected. Now Region-2 is the active region.
  • Step 3: I continue accessing the public URL of service from browser but it keeps on failing for a good 15-18 minutes after which it starts hitting the right active region which is Region-2
  • However, after step 3 if I open a new browser and access the same URL, it goes to correct Region-2 as expected.

Upon investigation we found browsers caches DNS and socket records which does not expire until something like 15 minutes. The problem is, this is resulting in poor User experience despite the fact that we have 2 active AWS regions.

Expected behavior is that as soon as Region-1 goes down or is facing significant latency the subsequent requests from the same browser should automatically go to the active Region ( or region with less latency).

Any comments on our expectations and any fixes/workaround except for deleting browser's DNS and Socket cache, which we cant expect our webapp users to perform?

1 Answer
0

Hello,

Good Day!

Since you have mentioned that if there are two AWS regions in active mode (say Region-1 and Region-2), have health checks registered with Route53 which is setup for 'latency based routing' with Route53 TTL as 30 seconds.

The AWS regions (us-west-2 and us-east-2) refer to Region-1 and Region-2 respectively.

Let's say, you have a public URL "abc.xyz.com" of a service and it is getting service from Region-1. The same service is available in Region-2 as well.

Now as you have mentioned above, if Region 1 will went down, Route 53 will marks it as unhealthy in couple of minutes and will start resolving the queries from Region 2 based on the TTL (Time to live) value you will set for this.

Please note: If an alias record points to an AWS resource, you can't set the time to live (TTL); Route 53 uses the default TTL for the resource. If an alias record points to another record in the same hosted zone, Route 53 uses the TTL of the record that the alias record points to.

During the region 1 failure the traffic will start flowing from the other region. Now If your browser keeps on failing for 15 minutes after which it starts hitting the right active region which is Region-2 then this is not be an issue with the Route 53. I am sorry to say that this is completely related to the browser caches DNS and socket records which is from client browser and nothing can be done from the Route 53 end. The proposed solution we would recommend to fix this by deleting DNS and Socket cache which you won't like your web app users perform.

Remember that DNS caching can occur anywhere from your network layer, through the operating system, to the application container. For example, Java virtual machines (JVMs) are notorious for caching DNS indefinitely unless configured otherwise. You should implement the following best practices.

  1. Set a low Time-to-Live (TTL) − The TTL is a field in the DNS response that specifies how long the response can be cached. By setting a low TTL, you can reduce the amount of time that cached results are stored, limiting the window of opportunity for attackers to manipulate the cache.

  2. Implement split-horizon DNS − Split-horizon DNS is a technique that uses different DNS servers and cache settings for internal and external networks. By implementing split-horizon DNS, you can prevent internal DNS queries from being cached on external servers, reducing the risk of cache poisoning attacks.

  3. Regularly clear DNS caches − Regularly clearing DNS caches can help ensure that cached results are up-to-date and reduce the risk of cache poisoning attacks. You can manually clear DNS caches on client and server-side, or configure automatic cache flushing using DNS server software.

AWS
answered 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions