Skip to content

How do I resolve high HADR_SYNC_COMMIT wait times in Amazon RDS for SQL Server Multi-AZ instances?

5 minute read
1

My Amazon Relational Database Service (Amazon RDS) for Microsoft SQL Server Multi-AZ instance has high HADR_SYNC_COMMIT wait times that cause slow query performance.

Short description

High HADR_SYNC_COMMIT wait times occur when transactions on the primary replica wait for the secondary replica to acknowledge a written transaction log.

The following issues can cause high HADR_SYNC_COMMIT wait times:

  • Slow disk I/O or I/O throttling
  • High CPU utilization
  • Network latency
  • Resource-intensive index operations that generate high transaction log volume

For more information about HADR_SYNC_COMMIT wait times, see HADR_SYNC_COMMIT wait updates on the Microsoft website.

Resolution

Prerequisites:

Monitor replication lag and log send queue issues

Run the following SQL query:

SELECT
  ag.name AS ag_name,
  ar.replica_server_name,
  db_name(drs.database_id) AS database_name,
  drs.log_send_queue_size  AS log_send_queue_kb,
  drs.log_send_rate AS log_send_rate_kb_s,
  drs.redo_queue_size AS redo_queue_kb,
  drs.redo_rate AS redo_rate_kb_s,
  drs.last_sent_time,
  drs.last_received_time,
  drs.last_hardened_time,
  DATEDIFF(MILLISECOND, drs.last_sent_time,
       drs.last_hardened_time) AS send_to_harden_lag_ms,
  drs.synchronization_state_desc,
  drs.synchronization_health_desc
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar ON drs.replica_id = ar.replica_id
JOIN sys.availability_groups ag  ON ar.group_id = ag.group_id
 WHERE drs.is_primary_replica = 0
ORDER BY drs.log_send_queue_size DESC;

Review the query results for high values in log_send_queue_kb and send_to_harden_lag_ms for replication lag. For more information, see Troubleshooting log send queueing in an Always On availability group on the Microsoft website.

Identify queries that cause high HADR_SYNC_COMMIT wait times

Complete the following steps:

  1. Open the Database Instance Dashboard on the Amazon CloudWatch console.
  2. Select the time frame that corresponds to the period of high wait times.
  3. Check the Top SQL tab to identify queries with the longest execution times and high HADR_SYNC_COMMIT wait times.
  4. Review the resource-intensive queries and the wait events during this period.

Check metrics for your DB instance in CloudWatch

To identify resource bottlenecks, use the CloudWatch console to check the following metrics for your DB instance:

  • Confirm that the ReadIOPS and WriteIOPS metrics don't exceed the allocated IOPS for the DB instance.
  • Confirm that the ReadThroughput and WriteThroughput metrics don't exceed the allocated throughput for the DB instance.
  • Check whether the CPUUtilization metric is consistently high during periods of high HADR_SYNC_COMMIT wait times.
  • Check whether the FreeableMemory metric is low. Low memory causes an increase in disk I/O operations when the buffer pool can't effectively cache data.
  • Check whether the DiskQueueDepth metric is high. High DiskQueueDepth values show that I/O requests accumulate faster than the storage subsystem can process them.

Use Enhanced Monitoring to identify resource bottlenecks

To determine whether read or write IOPS exceed the allocated IOPS and throughput or the instance throttles I/O operations, monitor OS metrics.

Complete the following steps:

  1. Open the Amazon Aurora and RDS console.
  2. In the navigation pane, choose Databases, and then select the database that you want to monitor.
  3. Choose the Monitoring tab.
  4. Choose the Monitoring menu, and then choose Enhanced Monitoring.
  5. On the Monitoring menu, choose Manage graphs.
  6. Select readIOsPS, writeIOsPS, readKb, and writeKb disk I/O metrics.
  7. Choose the corresponding settings icon to switch between the primary and secondary instance. Then, review the disk I/O metrics for both instances.

Reduce transaction log volume

To reduce the volume of transaction log records that are generated and the associated HADR_SYNC_COMMIT wait times, take the following actions:

  • To reduce the volume of transaction log records that are generated during modification operations, remove unnecessary indexes.
  • Schedule index rebuild and reorganization operations during low-traffic periods. Use index reorganization instead of full rebuilds when possible. Reorganization generates fewer log records than full rebuilds.
  • Reduce the maximum degree of parallelism (MAXDOP) for index maintenance operations. A lower MAXDOP limits parallelism and reduces the volume of log data that must be written to the secondary replica. However, a lower MAXDOP also reduces the operation speed.

Increase IOPS and throughput allocation

If CloudWatch or Enhanced Monitoring metrics confirm that IOPS or throughput is throttled, then take the following actions:

Note: If workload doesn't increase with high HADR_SYNC_COMMIT wait times, then upgrade the DB instance to the latest supported minor version to improve replication performance. As an immediate workaround, reboot the DB instance without failover to restore normal operation.

Related information

SQL Server, availability replica object on the Microsoft website

Monitoring Amazon RDS metrics with Amazon CloudWatch

Monitoring OS metrics with Enhanced Monitoring

Monitoring DB load with Performance Insights on Amazon RDS

Amazon Elastic Block Store (Amazon EBS) volume performance

AWS OFFICIALUpdated 2 months ago