Amazon RDS Maintenance Window - How do you handle High Availability

1

Hi,

At present we currently use our own instances of MySQL. I am looking to migrate to RDS managed service. Reading through I see it has 30 minutes of maintenance window per week. This roughly translates to availability of the app to be 99.7%.

I understand in the multi-AZ architecture of things such as Aurora where a Read replica can be deployed in a different AZ and promote it to be primary during that period. But looks like cross AZ traffic to the read replica (which promotes to be primary gets charged).

From strategy perspective, would this be an accurate summary:

  1. Have different maintenance windows for the standby vs read replica. Ideally read replica will have an earlier window so it is upgraded first.
  2. After the read replica or standby is upgraded, then fail over the primary to this.
  3. Let primary go through the maintenance window
  4. Fail it over back to primary. (To avoid the cross Availability zone data transfer charges).

Is this a correct understanding?

How do you handle higher availability requirements with RDS? Or is it just not possible?

asked 4 years ago2062 views
5 Answers
2

Let's dig in on several different points here.

First, the maintenance window. The maintenance window is an identified window that RDS can use to perform maintenance when it is necessary. Think of the RDS philosophy as "we don't mess with your instance, but when we have to (or you direct us to) we do it during the window that you tell us it will have the least potential impact." There are five critical pieces of data that you aren't taking into account. First, most maintenance windows go unused as there is no regular maintenance required. Second, when maintenance is required the actual impact is a small fraction of the time in the available window. For example, just the time necessary to reboot the instance. Or it might be an action that degrades performance, but doesn't cause an outage. Third, if you are using Multi-AZ then many of the disruptive maintenance actions just require a failover to the standby which is a ~60 second outage. For example if hardware maintenance to the primary's host is required, a failover to the secondary would occur making it the primary, and then a new secondary would be spun up in the background. Fourth, most things done during the maintenance window are not at all disruptive. For example, there is a management agent that runs on every instance. It is designed to be replaceable by new versions without any disruption to the normal operation of the instance. Out of an abundance of caution RDS only applies the updates to this agent during the maintenance window. This agent gets updated a few times throughout the year, completely benign to the availability of the customer instance. Fifth, if maintenance that is potentially more disruptive is in the offing you will generally be notified in advance and usually have mechanisms to defer or reschedule that maintenance (at least for a short while).

You mention Read Replicas, but these are very different than Multi-AZ. Read Replicas are not intended for high availability, that's what Multi-AZ is for. Unlike Read Replicas, you are not charged for the data replication that takes place to maintain a Mutli-AZ secondary. Read Replicas are primarily used for read scaling, but they can be used for Disaster Recovery. In that scenario you usually have the RR in another region, and would only promote it to a master if your normal region experiences a severe outage. Read Replicas are asynchronous, so you will lose the most recent committed transactions when you do this. It is also a very manual process, and your applications have to be capable of being repointed to a new database. Neither of these is necessary with Multi-AZ, which uses a synchronous replication scheme and uses DNS to hide the actual instance IP address from the application.

One point on availability calculations, unless otherwise specified the industry standard for availability is to only take into account unplanned outages. So when people throw out 99.99% or whatever they don't include planned maintenance. A true best practice would be to have two numbers, one for just unplanned outages and one for planned and unplanned outages. The variability in planned downtime makes that difficult. Just putting this out there because it is important for doing comparisons. One thing you do have an advantage in on planned downtime for a self-managed server vs using RDS is that you have more control over planned outages when you are self-managing. For example, there are super critical security patches that appear every few years that RDS will force to be applied in a very short timeframe and that requires a reboot during your maintenance window. With a self-managed instance you could make the determination that you don't need the security patch and defer applying it for weeks, months, or years beyond when RDS insists on patching.

AWS has taken many actions over the years to reduce the need for planned outages to address infrastructure issues. A great deal of hardware/firmware maintenance that required a reboot 5 years ago can now be done without visible impact to the customer instance. Whenever possible this maintenance is done during your maintenance window so that even a minor impact, like a brief latency spike, occurs when you've said it should be least impactful.

Just to mention Aurora. It has many additional high availability features and I won't go into them. But, for example, it is focused on addressing planned outages as well. The Zero-Downtime Patching feature, for example, can often eliminate planned downtime when new versions of Aurora MySQL are installed.

RDS has a 99.95% SLA when using Multi-AZ. Again that is against unplanned outages and you can read the rest of the details at https://aws.amazon.com/rds/sla/. In practice RDS' availability is far higher than the SLA, though I'm not at liberty to say what actual measured availability is. Aurora is designed to support 99.99% availability, but I don't believe they've published an SLA. See https://hal2020.com/2017/12/13/service-level-agreements-sla/ for more on SLAs.

Now to your calculations. Reboots are a great proxy for really seeing an outage during the maintenance window because they are the only things that cause a true outage. The RDS team has had a philosophy of trying to limit instance reboots to no more than once a year. I recall there was a period on RDS MySQL where they went multiple years without a forced reboot. Then there were a series of industry-wide/MySQL security issues (e.g., heartbleed) that caused multiple reboots over a year period. But on average there are only 1 or 2 forced reboots a year. Any other reboots during the year are the result of actions you took, such as changing the instance type or upgrading the database software (minor) version. A reboot with Multi-AZ takes about ~60 seconds (because you failover first), a reboot without Multi-AZ takes a few minutes. Perhaps just 1-2, but maybe you want to model it as 5 or 10 minutes. Let's even be conservative and model it as 15 minutes. Assume AWS forces 1 or 2 reboots, and you take actions requiring another 1 or 2, each year. That's 30-60 minutes per year for planned maintenance, not the 30 minutes per week that your assumption each maintenance window is an outage used. And in all honesty, I suspect most customers see more like 5 minutes (or less) on planned outages per year (unless they do a major version upgrade, but those are totally up to you unless a version goes out of vendor support).

HalTemp
answered 4 years ago
1
Accepted Answer

You're welcome!

"Replication" is a pretty generic term that covers making copies of data, so its use can be confusing.

For going between non-Aurora instances for the purpose of creating Read Replicas, RDS uses the replication facilities of the individual database engines. It uses them in asynchronous mode for two reasons. First, in synchronous mode when the two instances can't talk the Primary will stall. Given that the Read Replica doesn't participate in the automated failover scheme that means failures degrade or stop processing. Second, since synchronous replication causes performance impact and the more synchronous replicas you have the worse the impact, and you really don't want synchronous replication across regional boundaries, synchronous Read Replicas are not offered as an option. In theory it could be made to work across AZs within a region, but the first problem would still apply: the Read Replicas don't participate in the automated failover scheme. So RDS separates out the High Availability (Multi-AZ) and Read Scaling/Disaster Recovery (Read Replicas) solutions.

Multi-AZ is A LOT more than just replication, but under the covers it uses various forms of replication. For all non-Aurora engines other than Microsoft SQL Server it uses volume replication and for SQL Server it uses Mirroring or Always-On Availability Groups. In all three cases the replication is synchronous so the replica is always holding the last commited transaction, and only supported across AZs within a region. Multi-AZ itself is a set of very sophisticated monitoring and decision making software that decides when failover is needed and how failover should be done. It isn't just trying to maintain quorum to deal with network partitioning (which is the traditional failover technique), it looks at soft-failure scenarios like the volume latency indicating a storage issue and does it make sense to failover to the standby instance whose volumes are healthy. It also tries to understand if there is a global problem, for example a failing AZ rather than just the one instance failing, in picking if it should failover or not. So its trying to make sure the instance that survives is in the healthy AZ. That isn't always clear just from the lone instance's behavior. Sometimes it decides that rebooting the instance will actually be faster than failing over. Etc. It is one cool piece of software.

Aurora is a totally different beast that uses asynchronous processing for just about everything. You can find details in a SIGMOD 2018 paper, deep dive sessions from re:Invent, or in a set of blog posts that Anurag Gupta wrote (https://aws.amazon.com/search/?searchQuery=anurag_gupta_aurora+quorum) . The way Aurora avoids synchronous replication for Multi-AZ is with the use of quorums for replicating the data at the storage layer. Since a transaction commit is never acknowledged until quorum is achieved there is no possibility of transactions being lost even though the underlying mechanism is asynchronous.

HalTemp
answered 4 years ago
profile picture
EXPERT
reviewed a month ago
0

Hi @HalTemp,

I am new to this forum, and I have no idea on how to mark this as helpful and Thank you properly. My apologies.

I am so impressed by your reply. Extremely thorough and gave me an insight into it that I could not gain by just reading the documents. Thank you once again.

One area of confusion for me honestly on the documentation is the replication. I understand in general the replication between the master and standby is "synchronous" and master to read replica is "asynchronous".

This document @ https://aws.amazon.com/rds/features/read-replicas/ I find it confusing.

If you look at Multi-AZ for aurora as async (which makes sense) but for non-aurora it says it is sync. Is this just a typo?

"Non-Aurora: synchronous replication; Aurora: asynchronous replication"

Once again Thank you! I could not have asked for a better response.

answered 4 years ago
0

Hi @HalTemp,

My profound thank you for helping me understand the MultiAZ replication and standbys. I appreciate you taking time on this.

answered 4 years ago
0

You're welcome.

HalTemp
answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions