What to monitor to decide on provisioning higher IOPS or Throughput on gp3

0

TL;DR:

gp3 storage allows you to specify extra IOPS or Throughput. We do not know what metrics to monitor to decide if that is a good idea

Situation:

We have an AWS PostgreSQL RDS instance for a data warehousing application. Not a lot of users, high loads in the overnight batch jobs.

Current instance type is m6g.large (2CPU 8GiB RAM) with 1300 GB Storage on gp3 SSD with 12000 IOPS and a storage throughput of 500 MiBps.

Complication

We have issues in the nightly batches with EBS Byte Balance depleting.

Questions

  • Should we increase either IOPS or the storage throughput?
  • To what metrics do we need to look to decide if we should increase IOPS or Throughput?
  • Or should we upgrade the instance type to m6g.xlarge? The latter has higher EBS Optimized Baseline bandwidth.
4 Answers
1

Hello.

If it's an IOPS or throughput issue, you can determine it by looking at the metrics below.
If the values ​​of the following metrics are high, it means that a large amount of the set IOPS and throughput are being consumed.
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-metrics.html

  • ReadIOPS
    • The average number of disk read I/O operations per second.
  • WriteIOPS
    • The average number of disk write I/O operations per second.
  • ReadThroughput
    • The average number of bytes read from disk per second.
  • WriteThroughput
    • The average number of bytes written to disk per second.
  • DiskQueueDepth
    • The number of outstanding I/Os (read/write requests) waiting to access the disk.

When changing the instance type, I think it is better to check the CPU usage rate and memory usage rate that can be checked with extended monitoring.
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring.OS.Enabling.html

Performance insights may also be helpful.
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.Overview.ActiveSessions.html

profile picture
EXPERT
answered 9 months ago
profile picture
EXPERT
reviewed 9 months ago
profile picture
EXPERT
reviewed 9 months ago
  • You may also find, upgrading the instance type naturally may speed up your overnight job becase of more RAM avaiable and CPU

1

Hi,

You probably want to read this very detailled article on that matter: https://blog.purestorage.com/purely-technical/an-analysis-of-io-size-modalities-on-pure-storage-flasharrays

Throughput and IOPS are interrelated but there is a subtle difference between them. 
Throughput is a measurement of bits or bytes per second that can be processed by a 
storage device. IOPS refers to the number of read/write operations per second. Both 
IOPS and throughput can be used together to describe performance. 

If you look at RDS metrics, to determine the right gp3 setup, you want to look at some specific metrics described on https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-metrics.html

  • ReadThroughput + ReadIOPS & WriteThroughput + WriteIOPs to see what you get
  • DiskQueueDepth: to see if I/O operation accumulate without being served instantly
  • NetworkReceiveThroughput & NetworkTransmitThroughput have to be monitored to make sure that network in not indeed the bottleneck between RDS and your requesting clients.
  • ReplicaLag (if you have replicas): to make sure that replicas do not create additional latency in write operations.

DiskQueueDepth is essential to monitor to reach optimal performances: if it's increasing, it means that you can improve your perfs: either increasing the number of IOPs or by increasing the size of the data in each I/O operation (which will reduce the number of required IOPs as a consequecnce)

Hope it helps!

Didier

profile pictureAWS
EXPERT
answered 9 months ago
profile picture
EXPERT
reviewed 9 months ago
1

Based on your Complication, "We have issues in the nightly batches with EBS Byte Balance depleting." it appears during the batch job, your bytes balance were depleted. EBS Bytes balance is a metric that telling us the instance (m6g.large) already used up the burst capability of EBS Instance bandwidth. The limit of this type of instance is up to 4750 Mbps [1], meanwhile the baseline is 630 Mbps [2]

Addressing your questions: Should we increase either IOPS or the storage throughput? A: It is not limited because of the IOPs or throughput so i don't think you should increase this.

To what metrics do we need to look to decide if we should increase IOPS or Throughput? A: You can see the throughput for the ReadThroughput+WriteTroughput [3]

Or should we upgrade the instance type to m6g.xlarge? The latter has higher EBS Optimized Baseline bandwidth. A: You can evaluate this option, please be aware, increasing to m6g.xlarge will result on 1188 Mbps baseline throughput, you can verify what throughput that you are having now (combined Write and Read Throughput [3], and decide based on it.

[1] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html#Concepts.DBInstanceClass.Summary

[2] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html#current-general-purpose

[3] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-metrics.html#rds-cw-metrics-instance

Hope it helps

Attha

AWS
answered 9 months ago
profile picture
EXPERT
reviewed 9 months ago
0

Thank you all for your help.

I found that Read&Write Throughput was indeed somehow the issue, because of the correlation :

Enter image description here

Thank you @Attha, I think your answer was the most clear.

First of all : the issue was 'solved' by moving from db.m6g.large to db.m6g.xlarge

Enter image description here

I think because of

  1. more memory, reducing the need for temporary disk writes in large queries ( as Gary Mclean already pointed out)
  2. higher EBS Optimized Baseline bandwidth [1]

What I still find confusing that both the instance & the gp3 storage have bandwith limits [2][3]. It made me think that, since you have 2 'dails' you can use on the gp3 storage, increasing either IOPS or Throughput could have worked as well? And basically, I thought that I could either try to find some metrics that pointed me in the right direction, or start a trail-and-error approach, hence the question.... I ended up with the latter, and luckily it seemed to work :-(

Do you think that increasing the gp3 throughput could have helped as well? Not even sure if it would have been cheaper than increasing the instance size.

Anyway, I still have not a good feeling that there is good guidance on when to turn the dail on the gp3 throughput limits (for EBS IOPS there is least some read line available see image below).

Enter image description here

[1] : https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html#current

[2] : https://repost.aws/questions/QUuek6dVHVSI-gb1i9a9rnSg/understanding-rds-throughput-limits

[3]: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html#Concepts.Storage.GeneralSSD

answered 9 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions