I have an Amazon Relational Database Service (Amazon RDS) for Microsoft SQL Server instance with read replica. I want to troubleshoot the replica lag in my Amazon RDS for SQL Server instance.
Short description
Amazon RDS for SQL Server Enterprise Edition supports read replica creation in the same AWS Region and across Regions. Data replication is asynchronous and uses Always-On technology to replicate data from a primary instance to a replica instance. RDS for SQL Server doesn't automatically reduce high replica lag between a source DB instance and its read replicas.
Resolution
Check your resource utilization
Use the Amazon CloudWatch Enhanced Monitoring and Performance Insights to check the resource utilization on your primary and replicate instances at a detailed level.
Confirm that CPU utilization isn't throttled. If you use a burstable instance type, then make sure that you have available CPU credits or that you turned on Unlimited mode.
Make sure that there's sufficient FreeableMemory, and ReadIOPS and WriteIOPS meet the provisioned quotas. If you use a gp2 volume, then confirm that you have available burst balance.
Check whether ReadThroughput and WriteThroughput reached the instance type quotas.
Note: A lack of resources in the replica instance can cause replica lag. It's a best practice to create the primary and replica instances with the same instance type, storage type, and number of IOPS. You can also scale up or down the read replica when usage is minimal compared with the primary instance.
Identify the timeframe when the replica lag started to increase, and then take the following actions:
Identify waits and deadlocks
Check whether there are any significant lock waits or deadlocks on the replica instance. Deadlocks occur between Select and DDL/DML transactions and cause delays in the application of transaction logs from the primary instance.
To check for blocking, run a query that's similar to the following example:
SELECT * FROM sys.sysprocesses WHERE blocked > 0;
Check for replica lag
Run a query on the primary instance to check for replica lag and maximum replica lag.
Replica lag
Run the following query:
SELECT
AR.replica_server_name,
DB_NAME (ARS.database_id) 'database_name',
AR.availability_mode_desc,
ARS.synchronization_health_desc,
ARS.last_hardened_lsn,
ARS.last_redone_lsn,
ARS.secondary_lag_seconds,
FROM
sys.dm_hadr_database_replica_states ARS
INNER JOIN
sys.availability_replicas AR ON ARS.replica_id = AR.replica_id
WHERE
DB_NAME(ARS.database_id) = 'database_name'
ORDER BY
AR.replica_server_name;
Note: Replace database_name with your database name.
Verify that the last_hardened_lsn value progresses on the read replica.
Maximum replica lag
For SQL Server, the ReplicaLag metric is the maximum lag of databases that are behind in seconds. For example, if you have two databases that lag 5 seconds and 10 seconds, then ReplicaLag is 10 seconds. To calculate the ReplicaLag metric, run the following query on the primary instance:
SELECT max(secondary_lag_seconds) max_lag
FROM sys.dm_hadr_database_replica_states;
Manage data synchronization and instance health
When you create a read replica, Amazon RDS takes a snapshot from the primary instance and then restores the snapshot to create a read replica instance. Amazon RDS replays transactions logs to synchronize the data with the primary instance. However, after you create a new instance, the instance experiences lazy loading that causes replica lag. This is expected behavior. To reduce the effects of lazy loading, use io1 or io2 volume type when you create the read replica. After you create the replica, you can convert back to a gp2 or gp3 volume type.
Run transactions in batches on the primary instance to avoid long transactions, and keep the transactions log file size small. Restart the replica instance only when required during high replica lag. Otherwise, Amazon RDS delays transactions log replays and your databases might to go into the Recovery state.
Because logs process from the primary instance, replica lag might temporarily occur when you modify the instance type on the primary or replica instance.
Also, when you modify the storage type or storage size, replica lag might temporarily occur until the storage optimization is completed. You can't monitor the progress of the storage optimization.
If you continue to experience replica lag, then check the status of user databases on the replica instance. To replay logs, the database status must be Online.
Note:
- Amazon RDS includes newly created databases in the lag calculation only after they're accessible on the read replica.
- ReplicaLag returns -1 when Amazon RDS can't determine the lag, such as during replica setup or when the read replica is in the Error state.
Related information
Working with read replicas for Microsoft SQL Server in Amazon RDS