Upgrading Aurora MySQL 5.6 to 5.7 on cross region replica stuck in pre-upgrade checks and reboots every 30 minutes


We have a two-region Aurora MySQL cluster. In order to upgrade from MySQL 5.6 to 5.7, the process indicated the cross-region replica must be upgraded first. After initiating that process, the cluster took a snapshot and then has only sent the event "Upgrade in progress: Performing online pre-upgrade checks."

The single reader instance in the cross-region cluster has been recording an event every 30 minutes of "DB instance shutdown". In viewing the log, the reader instance log keeps recording a single line of "/bin/cat: /etc/rds/huge_page_size.dat: No such file or directory".

This process has continued for 10 hours. A previous "trial run" migration with a cloned copy finished in 20 minutes.

I am unable to stop the upgrade process. I cannot remove the replica as it is in the middle of upgrading and the "Promote" option is disabled, so I cannot remove the lone reader instance that is rebooting. So I am stuck unable to remove the rebooting cross-region replica cluster in order to upgrade our primary cluster...

Is there any way to stop the upgrade process or force it to be promoted or otherwise remove this malfunctioning cluster? Or perhaps is there some way to address the "no such file or directory" error that may be preventing the upgrade from continuing?

asked 2 years ago644 views
1 Answer

I was able to remove the "stuck" upgrading cross-region replica by making an AWS CLI call to promote the read replica to be stand-alone. The CLI was required as the "Promote" menu option was disabled.

aws rds promote-read-replica-db-cluster --db-cluster-identifier my-cross-region-cluster-name

Once this completed, I was then able to delete the reader instance which then allowed the cluster to delete as well. I can now upgrade the origin cluster to MySQL 5.8 and then create a new cross-region replica AFTER the primary cluster has finished upgrading.

This does not technically tell how to "fix" the stuck upgrade, but is a work-around. If this had been our primary cluster that was stuck then it would require restoring from a snapshot.

answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions