- Newest
- Most votes
- Most comments
Hello.
If it's an IOPS or throughput issue, you can determine it by looking at the metrics below.
If the values of the following metrics are high, it means that a large amount of the set IOPS and throughput are being consumed.
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-metrics.html
- ReadIOPS
- The average number of disk read I/O operations per second.
- WriteIOPS
- The average number of disk write I/O operations per second.
- ReadThroughput
- The average number of bytes read from disk per second.
- WriteThroughput
- The average number of bytes written to disk per second.
- DiskQueueDepth
- The number of outstanding I/Os (read/write requests) waiting to access the disk.
When changing the instance type, I think it is better to check the CPU usage rate and memory usage rate that can be checked with extended monitoring.
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_Monitoring.OS.Enabling.html
Performance insights may also be helpful.
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_PerfInsights.Overview.ActiveSessions.html
Hi,
You probably want to read this very detailled article on that matter: https://blog.purestorage.com/purely-technical/an-analysis-of-io-size-modalities-on-pure-storage-flasharrays
Throughput and IOPS are interrelated but there is a subtle difference between them.
Throughput is a measurement of bits or bytes per second that can be processed by a
storage device. IOPS refers to the number of read/write operations per second. Both
IOPS and throughput can be used together to describe performance.
If you look at RDS metrics, to determine the right gp3 setup, you want to look at some specific metrics described on https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-metrics.html
- ReadThroughput + ReadIOPS & WriteThroughput + WriteIOPs to see what you get
- DiskQueueDepth: to see if I/O operation accumulate without being served instantly
- NetworkReceiveThroughput & NetworkTransmitThroughput have to be monitored to make sure that network in not indeed the bottleneck between RDS and your requesting clients.
- ReplicaLag (if you have replicas): to make sure that replicas do not create additional latency in write operations.
DiskQueueDepth is essential to monitor to reach optimal performances: if it's increasing, it means that you can improve your perfs: either increasing the number of IOPs or by increasing the size of the data in each I/O operation (which will reduce the number of required IOPs as a consequecnce)
Hope it helps!
Didier
Based on your Complication, "We have issues in the nightly batches with EBS Byte Balance depleting." it appears during the batch job, your bytes balance were depleted. EBS Bytes balance is a metric that telling us the instance (m6g.large) already used up the burst capability of EBS Instance bandwidth. The limit of this type of instance is up to 4750 Mbps [1], meanwhile the baseline is 630 Mbps [2]
Addressing your questions: Should we increase either IOPS or the storage throughput? A: It is not limited because of the IOPs or throughput so i don't think you should increase this.
To what metrics do we need to look to decide if we should increase IOPS or Throughput? A: You can see the throughput for the ReadThroughput+WriteTroughput [3]
Or should we upgrade the instance type to m6g.xlarge? The latter has higher EBS Optimized Baseline bandwidth. A: You can evaluate this option, please be aware, increasing to m6g.xlarge will result on 1188 Mbps baseline throughput, you can verify what throughput that you are having now (combined Write and Read Throughput [3], and decide based on it.
[2] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html#current-general-purpose
[3] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-metrics.html#rds-cw-metrics-instance
Hope it helps
Attha
Thank you all for your help.
I found that Read&Write Throughput was indeed somehow the issue, because of the correlation :
Thank you @Attha, I think your answer was the most clear.
First of all : the issue was 'solved' by moving from db.m6g.large to db.m6g.xlarge
I think because of
- more memory, reducing the need for temporary disk writes in large queries ( as Gary Mclean already pointed out)
- higher EBS Optimized Baseline bandwidth [1]
What I still find confusing that both the instance & the gp3 storage have bandwith limits [2][3]. It made me think that, since you have 2 'dails' you can use on the gp3 storage, increasing either IOPS or Throughput could have worked as well? And basically, I thought that I could either try to find some metrics that pointed me in the right direction, or start a trail-and-error approach, hence the question.... I ended up with the latter, and luckily it seemed to work :-(
Do you think that increasing the gp3 throughput could have helped as well? Not even sure if it would have been cheaper than increasing the instance size.
Anyway, I still have not a good feeling that there is good guidance on when to turn the dail on the gp3 throughput limits (for EBS IOPS there is least some read line available see image below).
[1] : https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-optimized.html#current
[2] : https://repost.aws/questions/QUuek6dVHVSI-gb1i9a9rnSg/understanding-rds-throughput-limits
[3]: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html#Concepts.Storage.GeneralSSD
Relevant content
- asked a month ago
- asked 5 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 9 months ago
- AWS OFFICIALUpdated 5 months ago
You may also find, upgrading the instance type naturally may speed up your overnight job becase of more RAM avaiable and CPU