What is the best way to configure Availability Zones for a fault-tolerant ALB?

0

We have a basic ALB with four availability zones, all in us-east-1[abcd]. Last week, we were effected by this outage at Amazon:

[03:42 PM PDT] Between 11:49 AM PDT and 3:37 PM PDT, we experienced increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. Our engineering teams were immediately engaged and began investigating. We quickly narrowed down the root cause to be an issue with a subsystem responsible for capacity management for AWS Lambda, which caused errors directly for customers (including through API Gateway) and indirectly through the use of other AWS services. Additionally, customers may have experienced authentication or sign-in errors when using the AWS Management Console, or authenticating through Cognito or IAM STS.

My question is: how fault-tolerant is ALB if all of your availability zones are in the same region? For anyone who has knowledge of this outage, would selecting a zone in Boston or Atlanta given us better failover than chosing all the zones from us-east-1*?

profile picture
asked 10 months ago343 views
2 Answers
1

Hi,

an Application Load Balancer is a regional resource, meaning that you can only configure Availability Zones in the same region as your load balancer. To achieve a greater resiliency, you can leverage multiple load balancers in multiple regions. However, you also need to think about the resources behind these load balancers and how they can be kept in sync (if that's a requirement for you). If a multi-region solution makes sense for your workload, have a look at Global Accelerator which allows you to distribute traffic across multiple load balancers in one or more AWS Regions. (https://docs.aws.amazon.com/global-accelerator/latest/dg/what-is-global-accelerator.html) as well as https://docs.aws.amazon.com/whitepapers/latest/real-time-communication-on-aws/cross-region-dns-based-load-balancing-and-failover.html

profile pictureAWS
EXPERT
answered 10 months ago
  • Thank you. I understand all that. In AWS, in us-east-1, there are several availability zones you can select from. For example, us-east-1a, us-east-1b, etc. Apparently, all the AZ in us-east-1* were impacted by last week's outages and the ALBs were not able to compensate. So, my question is, if we had selected one of the additional zones, like us-east-1-bos-1, would that have given us better fault-tolerance in lieu of last weeks region-wide outage in the AWS lambda services?

  • Hi, thanks for the clarification. It's hard to say because a Local Zone is not fully independent of its parent zone. The control plane (as compared to the data plane) of some services like EC2 run in the parent region (see https://aws.amazon.com/blogs/networking-and-content-delivery/hybrid-inspection-architectures-with-aws-local-zone/).

1

AWS has two layers of fault isolation: AZ and region. AZs are disparate buildings within a metropolitan area with low latency internet links, collectively called a region. No single power outage / truck crash / backhoe should be able to entirely take offline an entire region. Then you have the control plane above that -- one control plane per region. A bad control plane can affect multiple AZs within a region, though AWS has some mitigations against that.
Operating in multiple AZs protects you from many physical failures. Operating in multiple regions protects you from most control-plane failures. I think someone did a study and showed that us-east-1 is more likely then other regions to suffer a control-plane failure like this, but all the regions can. If you operate in a single region you are always susceptible to a control plane failure.

profile picture
answered 10 months ago
  • Thank you. That's useful information. In regard to the failure of the lambdas in last week's outage. Is there anything, short of setting up a Route 53 multi-region load-balancer that would have mitigated the problem for us? For example, if we had chosen us-east-1-bos-1 or us-east-1-atl for an AZ, would we have more protection from an outage like that? We were somewhat shocked that, in spite of our best effort to design for fault-tolerance, we suffered an outage.

  • I'm less familiar with Local Zones and how their control plane works; my inclination is to say using a us-east-1* LZ does not insulate you from us-east-1 failure but I can't say for sure. Being in us-east-2 would insulate you from a us-east-1 failure, but not from a failure of us-east-2. If you need that level of reliability then a GTM or GlobalAccelerator load balancer across two disparate regions is needed. Note that you can't rely on being able to change the R53 configuration if us-east-1 dies, but behavior changes due to healthcheck failures or ApplicationRecoveryController switches still works.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions