Why is my Amazon Aurora DB cluster clone, snapshot restore, or point in time restore taking so long?
4 minute read
I'm performing a cluster clone, snapshot restore, or a point in time operation on my Amazon Aurora cluster.
Amazon Aurora’s continuous backup and restore techniques are optimized to avoid variation in restore times. They also help the cluster’s storage volume to reach full performance as soon as the cluster becomes available. Long restore times are generally caused by long-running transactions in the source database at the time that the backup is taken.
Amazon Aurora backups your cluster volume’s changes automatically and continuously. The backups are retained for the length of your backup retention period. This continuous backup allows you to restore your data to a new cluster, to any point in time within the retention period specified. This avoids the need for a lengthy binlog roll-forward process. Because you create a new cluster, there is no impact to performance or interruption to your original database.
When you initiate a clone, snapshot, or point in time restore, Amazon Relational Database Service (Amazon RDS) calls the following APIs on your behalf:
Either RestoreDBClusterFromSnapshot or RestoreDBClusterToPointInTime. These APIs create a new cluster and restore volume from Amazon Simple Storage Service (Amazon S3). This can take up to a few hours to complete. When you restore data to an Aurora cluster, all data is brought in parallel from Amazon S3 to the six copies on your three Availability Zones.
When this step completes, the cluster changes into the Available state. You can check your cluster state by refreshing the console or checking with the AWS CLI.
The instance creation process starts only when the cluster is Available. This happens in two stages: setting up the instance configuration and database crash recovery.
You can check if the API has finished setting up the instance by looking for the MySQL error log file. You can do this even if the instance is in the Creating status. If the error log file is available to download, then the instance is set up and the engine is now performing crash recovery. The error log file is also the best resource to check on the progress of your database crash recovery, along with Amazon CloudWatch metrics.
Note: If you're using the AWS CLI or API to perform a restore operation, then you must invoke the CreateDBInstance call because it's not automatic.
Check for long-running write operations on the source database
It’s a best practice to confirm that there aren’t long-running write operations on the source database at the time of the snapshot, point-in-time, or clone. Any long-running DCL, DDL, or DML (open write transactions) might lengthen the time it takes for the restored database to become available.
For example, you activate the binary log for an Aurora cluster, and this increases the time it takes to perform a recovery. This is because InnoDB automatically checks the logs and performs a roll-forward of the database to the present. It then rolls back any uncommitted transactions that are present at the time of the recovery. For more information on InnoDB crash recovery, see Innodb recovery.
When the instance finishes the creation and recovery processes, the cluster and the instance are then ready to accept incoming connections.
Note: Aurora doesn't require the binary log. It's a best practice to deactivate it unless it's required. For cross-Region replication, you can evaluate the Aurora global databases instead. Aurora global databases also don't require binary logs.