How do I troubleshoot high replica lag with Amazon RDS for MySQL?
I want to know why there’s replica lag when I use Amazon Relational Database Service (Amazon RDS) for MySQL.
Short description
Because Amazon RDS for MySQL uses asynchronous replication, sometimes the replica can't progress with the primary DB instance and causes a replication lag.
To monitor a replication lag, use an RDS for MySQL read replica with binary log file position-based replication.
In Amazon CloudWatch, check the ReplicaLag metric for Amazon RDS. The ReplicaLag metric reports the value of the Seconds_Behind_Master field of the SHOW SLAVE STATUS command.
The Seconds_Behind_Master field shows the current timestamp on the replica DB instance. It also shows the original timestamp that's logged on the primary DB instance for the event that processes on the replica DB instance.
MySQL replication uses the binary log dump, replication I/O receiver, and replication SQL applier threads. For more information about how the threads function, see Replication threads on the MySQL website. If there's a delay in replication, then identify whether IO_THREAD replica or SQL_THREAD replica causes the lag. You can then identify the root cause of the lag.
Resolution
Identify the replication thread that's lagging
Run the SHOW MASTER STATUS command on the primary DB instance:
mysql> SHOW MASTER STATUS;
Example output:
+----------------------------+----------+--------------+------------------+-------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +----------------------------+----------+--------------+------------------+-------------------+ | mysql-bin.066552 | 521 | | | | +----------------------------+----------+--------------+------------------+-------------------+ 1 row in set (0.00 sec)
Note: In the preceding example output, the source or primary DB instance writes the binary logs to the mysql-bin.066552 file.
Run the SHOW SLAVE STATUS command on the replica DB instance:
mysql> SHOW SLAVE STATUS\G;
Example output 1:
*************************** 1. row *************************** Master_Log_File: mysql-bin.066548 Read_Master_Log_Pos: 10050480 Relay_Master_Log_File: mysql-bin.066548 Exec_Master_Log_Pos: 10050300 Slave_IO_Running: Yes Slave_SQL_Running: Yes
Note: In the preceding example output, the Master_Log_File: mysql-bin.066548 shows that the IO_THREAD replica reads from the mysql-bin.066548 binary log file. The primary DB instance writes the binary logs to the mysql-bin.066552 file. The IO_THREAD replica is four binlogs behind. However, because the Relay_Master_Log_File is mysql-bin.066548, the SQL_THREAD replica reads from same file as IO_THREAD. The SQL_THREAD replica maintains speed, but the replica IO_THREAD lags.
Example output 2:
*************************** 1. row *************************** Master_Log_File: mysql-bin.066552 Read_Master_Log_Pos: 430 Relay_Master_Log_File: mysql-bin.066530 Exec_Master_Log_Pos: 50360 Slave_IO_Running: Yes Slave_SQL_Running: Yes
The preceding example output shows that the primary instance's log file is mysql-bin.066552. The IO_THREAD maintains speed with the primary DB instance. In the replica output, the SQL thread performs Relay_Master_Log_File: mysql-bin.066530. As a result, SQL_THREAD lags 22 binary logs behind.
Typically, IO_THREAD doesn't cause large replication delays because the IO_THREAD reads the binary logs only from the primary or source instance. However, network connectivity and network latency can affect the speed of the reads between the servers. High bandwidth usage can cause the IO_THREAD replica to perform more slowly.
If the SQL_THREAD replica is causing the replication delays, then use the following troubleshooting steps to resolve your issue.
Long-running write queries on the primary instance
Long-running write queries on the primary DB instance that take an equal amount of time to run on the replica DB instance can increase seconds_behind_master. For example, if a change to the primary instance takes 1 hour to run, then the lag is 1 hour. If the change also takes 1 hour to complete on the replica, then the total lag is approximately 2 hours.
To minimize lag, you can monitor the slow query log on the primary instance. You can also reduce long-running statements to smaller statements or transactions.
Insufficient DB instance class size or storage
If the replica DB instance's class or storage configuration is lower than the primary instance's, then the replica might throttle because of insufficient resources. The replica can't maintain the number of changes on the primary instance.
To resolve this issue, make sure that the replica's DB instance type is the same as or higher than the primary DB instance. For replication to operate effectively, each read replica requires the same number of compute and storage resources as the source DB instance. For more information, see DB instance classes.
Parallel queries run on the primary DB instance
The MySQL replication is single threaded by default. So when you run queries in parallel on the primary instance, the queries commit on the replica in a serial order. When a high volume of writes to the source instance occurs in parallel, the writes to the read replica use a single SQL_THREAD to serialize. A lag between the source DB instance and read replica might then occur.
Multi-threaded (parallel) replication is available for MySQL 5.6 and later versions. For more information about multi-threaded replication, see Binary logging options and variables on the MySQL website.
Multi-threaded replication can cause gaps in replication. For example, multi-threaded replication isn't a best practice when you skip the replication errors because it's difficult to identify the transactions that you skip. Gaps in data consistency between the primary and replica DB instances might occur.
Binary logs synced to the disk on the replica DB instance
When you turn on automatic backups on the replica, overhead occurs to sync the binary logs to the disk on the replica. The default value of the sync_binlog parameter is set to 1. If you change the value to 0, then you also turn off the MySQL server's ability to synchronize the binary log to the disk. Instead, the operating system occasionally flushes the binary logs to disk.
To reduce the performance overhead that's required to sync the binary logs to disk on every commit, turn off the binary log synchronization. However, if there's a power failure or the OS crashes, then some of the commits might not synchronize to the binary logs. Asynchronization can affect point-in-time restore (PITR) capabilities. For more information, see sync_binlog on the MySQL website.
Binlog_format is set to ROW
The SQL thread performs a full table scan when it's replicating in the following scenarios:
- The binlog_format on the primary DB instance is set to ROW.
- The source table doesn't have a primary key.
This happens because the default value of the slave_rows_search_algorithms parameter is TABLE_SCAN,INDEX_SCAN.
To temporarily resolve this issue, change the search algorithm to INDEX_SCAN,HASH_SCAN to reduce the overhead of a full table scan. For a more permanent solution, it's a best practice to add an explicit primary key to each table.
For more information about the slave-rows-search-algorithms parameter, see slave_rows_search_algorithms on the MySQL website.
Replica creation lag
Amazon RDS takes a DB snapshot to create a read replica of a MySQL primary instance. Then, Amazon RDS restores the snapshot to create a new DB instance and establishes replication between the two.
After you establish replication, a lag occurs when Amazon RDS is creating a backup of the primary DB instance. To minimize the lag, create a manual backup before you call for the replica creation. The DB snapshot is then an incremental backup.
When you restore a read replica from a snapshot, the replica doesn't wait for all the data to transfer from the source. The replica DB instance is available to perform the DB operations. The existing Amazon Elastic Block Store (Amazon EBS) snapshots create a new volume in the background.
Note: For Amazon RDS for MySQL replicas (Amazon EBS-backed volumes), the replica lag might initially increase because lazy loading can affect replication performance.
To reduce the effects of lazy loading on tables for your new read replica, you can perform operations that use full-table scans. For example, run mysqldump on your read replica for specific tables or databases so that Amazon RDS prioritizes all backed-up table data from Amazon Simple Storage Service (Amazon S3).
You can also use the on demand InnoDB cache warming feature. The InnoDB cache warming feature saves the buffer pool state on disk in a file that's named ib_buffer_pool in the InnoDB data directory. Performance improves because Amazon RDS dumps the current state of the primary DB instance buffer pool before you create the read replica. You can then reload the buffer pool after you create the read replica.
Related information
Related videos


Relevant content
- asked 3 years agolg...
- asked 2 years agolg...
- asked 3 years agolg...
- asked 9 months agolg...
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 8 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 years ago