How do I troubleshoot lag in my RDS for SQL Server read replica?

5 minute read
0

I have an Amazon Relational Database Service (Amazon RDS) for Microsoft SQL Server instance with read replica. I'm seeing one of the following in my DB instance:

There is a sudden increase in replica lag. Modification of the instance started to cause replica lag. The database on the read replica instance isn't accessible.

How can I troubleshoot these issues?

Short description

Amazon RDS for SQL Server enterprise edition supports creation of a read replica within the same Region. Data replication is asynchronous and uses Always-On technology to replicate data from a master to a replica instance. RDS for SQL Server doesn't intervene to mitigate high replica lag between a source DB instance and its read replicas.

Resolution

1.    Check resource utilization on the master and on the replica instance using Amazon CloudWatch. Use the Enhanced Monitoring and Performance Insights features to check resource usage at a granular level.

Important considerations for metrics on the master and replica instances:

2.    It's a best practice to create the master and replica instances with the same instance class, storage type, and number of IOPS. This avoids replica lag due to lack of resources in the replica instance. Additionally, depending on the workload, the read replica can be scaled up or scaled down if usage is minimal compared to the master instance.

3.    Identify the timeframe when replica lag started to increase and then do the following:

Check the WriteIOPS, WriteThroughput, NetworkReceiveThroughput and NetworkTrasmitThroughput metrics on the master instance based on start time of replica lag. Determine if the lag is due to write activity. Check the same metrics in the same time period on the read replica.

Check if there are long-running transactions on the master instance. The following is an example query to the check status of active transactions:

SELECT * FROM sys.sysprocesses WHERE open_tran = 1;

4.    On the replica instance, check if there are any significant lock waits or deadlocks. Deadlocks occur between Select and DDL/DML transactions and cause delays in applying transaction logs from the master instance.

The following is an example query to check for blocking:

SELECT * FROM sys.sysprocesses WHERE blocked > 0;

5.    Query to check for replica lag and maximum replica lag.

Replica lag

SELECT AR.replica_server_name
     , DB_NAME (ARS.database_id) 'database_name'
     , AR.availability_mode_desc
     , ARS.synchronization_health_desc
     , ARS.last_hardened_lsn
     , ARS.last_redone_lsn
     , ARS.secondary_lag_seconds
FROM sys.dm_hadr_database_replica_states ARS
INNER JOIN sys.availability_replicas AR ON ARS.replica_id = AR.replica_id
--WHERE DB_NAME(ARS.database_id) = 'database_name'
ORDER BY AR.replica_server_name;

Verify that the 'last_hardened_lsn' value is progressing on the read replica.

Maximum replica lag

For SQL Server, the ReplicaLag metric is the maximum lag of databases that have fallen behind, in seconds. For example, if you have two databases that lag 5 seconds and 10 seconds, respectively, then ReplicaLag is 10 seconds. The ReplicaLag metric returns the value of the following query. Run the query on the master instance.

select max(secondary_lag_seconds) max_lag  from sys.dm_hadr_database_replica_states;

6.    After you initiate read replica creation, a snapshot is taken from the master instance, and then restored to create a read replica instance. Transactions logs are replayed to synchronize the data with the master instance. However, after you create new instance, that instance experiences lazy loading, which causes replica lag. This is expected behavior. To minimize the effect of lazy loading, use IO1 type storage during read replica creation and then convert it back to GP2, if required.

7.    Run transactions in batches on the master instance. This avoids running long transactions and keeps the transaction log file size minimal. Don't restart the replica instance unless required during high replica lag as doing this further delays replay of transaction logs.

8.    Modification of the instance class on the master or replica instance might cause temporary replica lag. This is expected behavior because logs are processing from the master instance.

Changing the storage type or storage size has a longer impact on replica lag until the storage optimization is completed. It isn't possible to find how much percentage of storage optimization is completed on RDS instances.

9.    If the read replica reaches the storage full state, then the transaction logs from the master instance aren't processed and replica lag increases.

If you suspect that storage space is due to TempDB or temporary tables, then restart the replica instance to temporarily release space.

10.    If you're experiencing no progress in replica lag status, check the status of user databases on the replica instance. To replay logs, the database status must be Online.

Be aware of the following:

  • Newly created databases aren't included in the lag calculation until they're accessible on the read replica.
  • ReplicaLag returns -1 if RDS can't determine the lag, such as during replica setup, or when the read replica is in the error state.

Related information

Working with read replicas for Microsoft SQL Server in Amazon RDS