Aurora MySQL Automatic Recovery Loop

0

Our instance with a single reader/writer had been humming along for some time without issue. Then last week in the middle of the night it got stuck in a bit of an automatic recovery loop, going through the full recovery process about 6 times. It did the same thing again this morning. I know we are responsible for managing these sorts of outages, and plan to add a second instance per AWS recommendation and recommendations here on similar threads, but this seems a bit abnormal? A single recovery due to bad hardware or an underlying systems change is one thing, getting stuck in a loop for several hours, recovering multiple times, does not?

The main messages to kick it off is: "Recovery of the DB instance has started. Recovery time will vary with the amount of data to be recovered."

We don't see any issues in the underlying MySql logs, just repeated startups, the recovery happens in the middle of the night (us-west-2), and we haven't made any recent changes except bumping to 5.7 several weeks ago.

2 Answers
0

Thanks for the quick reply! I just completed creating a reader so our cluster now has more than one instance across more than one AZ. Our last change was to move up to the latest available version of 2.x.x (5.7.mysql_aurora.2.11.0). We are using the default parameter group (default.aurora-mysql5.7) with max_connections, I believe, unchanged from the default ((GREATEST({log(DBInstanceClassMemory/805306368)*45},{log(DBInstanceClassMemory/8187281408)*1000})).

I will certainly take a closer look at the metrics and logs, but as far as I know the overall workload hasn't changed and the only reason I didn't dig deeper into instance limits is because these recovery actions only happen in the middle of the night (us-west-2) and just barely impact our primary workloads as they ramp up in the morning. But, like I said, the initial recovery happens hours after our day is over and hours before it begins.

I will look closer at the documents you provided to see if they point to any issues on our end. And, just to confirm, re:Post is our only option for support without a set support agreement on our account, correct? As in, 'open a support case' isn't really an option for us in this instance, right? Thanks, again!

answered a year ago
  • Correct, you need at least a Developer plan to create a technical support case. Basic support plan includes 24x7 access to customer service, documentation, whitepapers, and AWS re:Post but not technical support. Please remember to accept the answer if that helped you so that we can flag this question as answered. hope that helps!

0

First, you are correct that we recommend Multi AZ deployments. So if one host fails, you can always fail over to the other instance. This also ensures there is no data loss nor downtime.

Now, you have been stable for months and then faced multiple recoveries recently. Without seeing the instance logs or getting such information it is difficult to determine the root cause. It is possible there was a hardware failure, however... was there any change at all to the workload or any change to any MySQL parameter (like max_connections) outside the upgrade to 5.7? Are you using the latest Aurora MySQL database engine version [1]?

Have you checked the instance resources as well (CPU, memory, disk) to ensure they are below the instance limits [2]? Here is an article you can go through to review memory usage [3]. It specifies RDS MySQL but applies to Aurora MySQL as well.

If possible for you, opening a support case is a good way to get the instance reviewed by our support engineers who will be able to clarify if the issue is due to hardware failure or something else.

Hope it helps.

[1] https://docs.aws.amazon.com/AmazonRDS/latest/AuroraMySQLReleaseNotes/AuroraMySQL.Updates.2110.html

[2] https://aws.amazon.com/premiumsupport/knowledge-center/view-cpu-memory-aurora/

[3] https://aws.amazon.com/premiumsupport/knowledge-center/low-freeable-memory-rds-mysql-mariadb/

profile pictureAWS
EXPERT
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions