Write latency elevated without any obvious cause

0

We are running MySQL on a db.r3.4xlarge with 4000 provisioned IOPs on SSD. This morning we started seeing increased write latency for long periods, without any apparent increased load to explain it. CPU was under 12% and write IOPs were below 500, and actually went down to around 100 once the average write latency increased from its normal value of around 2 milliseconds up to 150 milliseconds. It almost looked as if something were throttling the writes, despite the fact that we're paying for provisioned capacity and staying well below that limit.

After thorough investigation of our queries, we concluded that it might be caused by hardware degradation of some kind. Since it's a Multi-AZ instance, we rebooted with failover and the problem went away, but then returned a little over an hour later. If this is something we're causing, we'd like to know how to avoid it, but we haven't seen it before in the three years we've been running this instance. Does anyone know what might cause write latency to be increased? It would also be good to understand more about how write latency is measured if that information is available.

已提問 5 年前檢視次數 610 次
2 個答案
0

Unfortunately there was a defective underlying storage volume on one of your disk. The defect was detected automatically and the volume was replaced. Your performance should be back to normal.
You should be able to look at the Enhanced Monitoring physical io metrics from this period and see that one of your volumes had elevated latencies for a period

Sorry for the inconvenience.

-Phil

AWS
管理員
philaws
已回答 5 年前
0

As far as I know, the defect was not detected automatically, but required creating a case with AWS support to correct. I'm noting this so that if anyone else runs into the same issue, they won't wait too long expecting it to resolve on its own. We experienced approximately 8 hours of degraded service because of this problem.

It's also worth noting that because RDS performs a synchronous write to both the primary and standby EBS volumes, elevated latency in either AZ can cause this problem. Manual failover will not help because the same two volumes are still being written to by the new primary instance.

已回答 5 年前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南