How can I identify if my Amazon EBS volume is micro-bursting and then prevent this from happening?

5 minute read
2

I have an Amazon Elastic Block Store (Amazon EBS) volume that isn't breaching its throughput or IOPS limit in Amazon CloudWatch. But the volume appears throttled and experiences high latency and queue length.

Short description

CloudWatch monitors the IOPS (op/s) and throughput (byte/s) for all Amazon EBS volume types by collecting samples every one minute.

Micro-bursting occurs when an EBS volume bursts high IOPS or throughput for significantly shorter periods than the collection period. Because the volume bursts high IOPS or throughput for a shorter time than the collection period, CloudWatch doesn't reflect the bursting.

Example: An IO1 volume (one-minute collection period) with 950 provisioned IOPS has an application that pushes 1,000 IOPS for five seconds. Amazon EBS throttles the application when it reaches the volume's IOPS limit. At this point, the volume can't handle the workload, causing increased queue length and higher latency.

CloudWatch doesn't show that the volume breached the IOPS limit because the collection period is 60 seconds. 1,000 IOPS occurred for only 5 seconds. For the remaining 55 seconds of the one-minute collection period, the volume remains idle. This means that the number of VolumeReadOps+VolumeWriteOps over the whole minute is 5000 operations (1000*5 seconds). This equates to an average of 83.33 IOPS over one minute (5000/60 seconds). This average usually isn't a concern.
In this case, the VolumeIdleTime at the same sample time is 55 seconds because the volume is idle for the remainder of the collection period. This means that the 5,000 operations (VolumeReadOps+VolumeWriteOps) at that sample time occurs over only five seconds. If you divide 5,000 by 5 to calculate the average IOPS, then you get 1,000 IOPS. 1,000 IOPS is the volume limit.

To determine if micro-bursting is occurring on your volume, do the following:

  1. Use CloudWatch metrics to identify possible micro-bursting.
  2. Use CloudWatch to get the micro-bursting event.
  3. Confirm micro-bursting using an OS-level tool.
  4. Prevent micro-bursting by changing your volume size or type to accommodate your applications.

Resolution

Use CloudWatch metrics to identify possible micro-bursting

  1. Check the VolumeIdleTime metric. This metric indicates the total number of seconds in a specified period of time when no read or write operations are submitted. If VolumeIdleTime is high, then the volume remained idle for most of the collection period. Sufficiently high IOPS or throughput at the same sample time indicates that micro-bursting potentially occurred.
    With the VolumeIdleTime metric for throughput there are VolumeReadBytes and VolumeWriteBytes metrics.
  2. Use the following formula to calculate the average throughput that's reached when the volume is active:
    Estimated Average Throughput in Bytes/s = (Sum(VolumeReadBytes) + Sum(VolumeWriteBytes) ) / (Period - Sum(VolumeIdleTime) )
    With the VolumeIdleTime metric for IOPS there are VolumeReadOps and VolumeWriteOps metrics.
  3. Use the following formula to calculate the average IOPS that's reached when the volume is active:
    Estimated average IOPS in Ops/s = (Sum(VolumeReadOps) + Sum(VolumeWriteOps) ) / ( Period - Sum(VolumeIdleTime) )

Use CloudWatch to get the micro-bursting event

  1. Open the CloudWatch console.
  2. Choose All Metrics.
  3. Use the volume ID to search for the volume that's affected.
  4. To view throughput metrics, choose Browse, and then add VolumeReadBytes, VolumeWriteBytes, and VolumeIdleTime.
  5. Choose Graphed metrics.
  6. For Statistics, choose Sum, and for Period, choose 1 minute.
  7. For Add Math, choose Start with empty expression.
  8. In the Details of Expression, enter the graph IDs for the Estimated Average Throughput in Bytes/s formula. For example, (m1+m2)/(60-m3).

If the formula calculates a value that's greater than the maximum throughput for the volume, then micro-bursting occurred. To check the IOPS metrics, follow the preceding steps, and add VolumeReadOps, VolumeWriteOps, and VolumeIdleTime for step 4.

Confirm micro-bursting using an OS-level tool

The preceding formulas don't always identify micro-bursting in real time. This is because the volume might be micro-bursting even if the VolumeIdleTime is low.

Example: Your volume spikes to a level that breaches the volume's limits. The volume then reduces to a very low level of activity without being completely idle for the remainder of the collection period. The VolumeIdleTime metric doesn't reflect the low activity, even though micro-bursting occurred.

To confirm micro-bursting, use an OS-level tool that has a finer granularity than CloudWatch.

Linux

Use the iostat command. For more information, see iostat(1) on the Linux man page website.

1.    To report I/O statistics for all your mounted volumes with one-second granularity, run the following command:

iostat -xdmzt 1

Note: The iostat tool is part of the sysstat package. If you can't find the iostat command, then run the following command to install sysstat on Amazon Linux AMIs:

$ sudo yum install sysstat -y

2.    To determine whether you're reaching the throughput limit, review the rMB/s and wMB/s in the output. If rMB/s + wMB/s is greater than the volume's maximum throughput, then micro-bursting is occurring.

To determine whether you're reaching the IOPS limit, review the r/s and w/s in the output. If r/s + w/s is greater than the volume's maximum IOPS, then micro-bursting is occurring.

Windows

Run the perfmon command in Windows Performance Monitor. For more information see, Determine your IOPS and throughput requirements.

Prevent micro-bursting by changing your volume size or type to accommodate your applications

Change the volume to a type and size that accommodates your required IOPS and throughput. For more information on volume types and their respective IOPS and throughput limits, see Amazon EBS volume types. There are limits on the IOPS/throughput the instance can push to all attached EBS volumes.

It's a best practice to benchmark your volumes against your workload to verify which volume types can safely accommodate your workload in a test environment. For more information, see Benchmark EBS volumes.

AWS OFFICIAL
AWS OFFICIALUpdated a year ago
1 Comment

Got Latency? This was very helpful in identifying microbursting on our MS SQL cluster as the culprit. The normal volume monitoring metrics do not show the correct and accurate IOP use (at least it did not for us). When we followed these steps (with the help of an AWS rep) we were able to see that where we thought we were over provisioning SQL data volume at 4K IOPs (since normal iop monitoring was saying we never came close to 4K)... it was actually using/bursting between 6K and 10K, so that is why we were seeing a lot of latency (using sql querys to display disk performance.) Why AWS does not include this metric in disk monitoring is beyond me, but I'm glad that AWS support was able to assist us in getting to the truth.

replied 5 months ago