How do I troubleshoot the high write or read latency of the Amazon EBS volumes in my Amazon RDS instance?

6 minute read
0

I want to troubleshoot the latency of the Amazon Elastic Block Store (Amazon EBS) volumes in my Amazon Relational Database Service (Amazon RDS) instance.

Short description

Before you begin troubleshooting latency issues, set up monitoring with the following:

  • Amazon CloudWatch Metrics
  • Enhanced Monitoring
  • Performance Insights

After you set up monitoring, troubleshoot your latency issues based on the following causes:

  • Micro-bursting
  • Lazy load
  • Amazon RDS storage
  • Amazon RDS instance
  • Throttling

Resolution

Set up monitoring

To set up monitoring for latency issues, complete the following steps:

Amazon CloudWatch metrics

  • Check ReadIOPS and WriteIOPS to determine if sufficient IOPS was provisioned. If these metrics reach a limit, then use GP3, Provisioned IOPS storage, or increase the storage allocation for GP2 storage.
  • Check EBSIOBalance% to monitor for intensive I/O and EBSByteBalance% to check if the throughput is too high for the instance type. Consistent low values for these metrics indicate an IOPS or throughput issue at the instance level.
  • For instance classes with burst bucket I/O, check if the IOPS or throughput values are limited to the baseline due to out of credit.

Enhanced monitoring

Throttling of IOPS or throughput indicates that the IOPS or throughput is inadequate for the workload at the storage or instance level. To resolve this issue, complete the following steps:

Optimize the SQL queries that create more load on the database. To do this, you might need to increase the IOPS provisioned. Enhanced Monitoring might help you with locating the thread ID. See the following examples:

SQL Server:

select * from master..sysprocesses where kpid = <example-thread-id>

MySQL (Performance Insights or Performance Schema must be turned on):

select p.* from information_schema.processlist p, performance_schema.threads t
  where p.id=t.processlist_id and t.thread_os_id=<Thread ID from EM processlist>;

If throttling of the IOPS or throughput occurs at the instance level, then scale up the instance class to achieve a higher capacity. For more information, see Viewing OS metrics in the RDS console.

Performance Insights

To identify queries that impact the performance of the database, combine Enhanced Monitoring with Performance Insights. Make sure you review the OS metrics with 1 second of granularity for patterns regarding SQL Server metrics such as:

  • totalKb
  • usedKb
  • usedPc
  • availKb
  • availPc
  • rdCountPS
  • rdBytesPS
  • wrCountPS
  • wrBytesPS

For more information, see OS metrics in Enhanced Monitoring.

Troubleshoot latency issues

Micro-bursting

Micro-bursting occurs when an EBS volume bursts high IOPS or throughput for significantly shorter periods than the collection period. CloudWatch metrics don't reflect micro-bursting because the volume bursts occur for a shorter period of time than the collection period for metrics (60 seconds).

Turn on Enhanced Monitoring with a granularity of 1 second to determine if micro-bursting causes the latency issues:

  • Use the ReadIO/s and Write IO/s metrics to determine the actual IOPS utilization.
  • Use the Read Kb/s and Write Kb/s to determine the actual throughput utilization.

For more information, see OS metrics in Enhanced Monitoring.

Lazy load

When you restore a DB instance from a snapshot, you might encounter a lazy load. Use Enhanced Monitoring to identify if any storage volumes aren't performing at a normal level. For more information, see Why is it taking so long to restore a snapshot of my Amazon RDS for MySQL DB instance?

If you still have issues, then contact AWS Support.

Amazon RDS storage

  • Check the configuration information of the Amazon RDS instances. Check if the DB instance class is using GP2, GP3, or provisioned IOPS storage.
  • Make sure to use a DiskQueueDepth of one per minute for every 1000 IOPS. Expect ReadLatency or WriteLatency to be within 10 milliseconds. If spikes occur, then mark the time of the spike.
  • With GP3 and IO1 storage you're able to allocate the desired I/O. With GP3 you're also able to decide on the amount of throughput to provision. With GP2 volumes below 1000 GB, they have the ability to burst up to 3000 IOPS. For more information, see gp3 storage.

RDS instance

  • Check the configuration information of the Amazon RDS instance. Check the DB instance class and defined provisioned IOPS to determine the IOPS limit or throughput limit for the DB instance class.
  • Use CloudWatch graphs to check for spikes in the DiskQueueDepth, ReadLatency, and WriteLatency values. It's a best practice to use a DiskQueueDepth of one per minute for every 1000 IOPS. ReadLatency or WriteLatency is expected to be within 10 milliseconds. If you notice spikes, then identify the time of the spike.
  • Use CloudWatch graphs to view the ReadIOPS and WriteIOPS metrics. Check if the IOPS limit had a breach during the timeframe that spikes occurred in the DiskQueueDepth, ReadLatency, and WriteLatency values.
  • Use CloudWatch graphs to view the ReadThroughput and WriteThroughput metrics. Check if the throughput limit had a breach during the timeframe that spikes occurred in the ReadThroughput and WriteThroughput values.
  • If you're using an EBS-optimized RDS instance class, then use CloudWatch graphs to check for throttling of IOPS or throughput. For instance classes with burst capacity, view the EBSIOBalance% and EBSByteBalance% metrics in the CloudWatch graphs. Consistently low percentage values indicate an IOPS or throughput bottleneck at the instance level.

Throttling

Throttling of IOPS, throughput, or both indicates that the IOPS or throughput is inadequate for your workload at the storage level. To resolve this issue, complete the following steps:

  • Identify the SQL queries that create more load on the database and then optimize these queries. If the workload is as expected or there's no scope for tuning the SQL queries, then increase the storage size for a higher IOPS capacity.

Note: After you increase the storage size of an RDS instance, you can't reduce the size to the previous value.

  • Switch the volume from General Purpose (gp2) to Provisioned IOPS (io1). During this modification, there's a brief suspension of I/O before the storage-optimizing state. After the brief suspension, the volume performs at the expected performance.
  • If throttling of IOPS or throughput occurs at the instance level, then scale up the instance class to get a higher capacity.
AWS OFFICIAL
AWS OFFICIALUpdated 8 months ago