Write latency elevated without any obvious cause

0

We are running MySQL on a db.r3.4xlarge with 4000 provisioned IOPs on SSD. This morning we started seeing increased write latency for long periods, without any apparent increased load to explain it. CPU was under 12% and write IOPs were below 500, and actually went down to around 100 once the average write latency increased from its normal value of around 2 milliseconds up to 150 milliseconds. It almost looked as if something were throttling the writes, despite the fact that we're paying for provisioned capacity and staying well below that limit.

After thorough investigation of our queries, we concluded that it might be caused by hardware degradation of some kind. Since it's a Multi-AZ instance, we rebooted with failover and the problem went away, but then returned a little over an hour later. If this is something we're causing, we'd like to know how to avoid it, but we haven't seen it before in the three years we've been running this instance. Does anyone know what might cause write latency to be increased? It would also be good to understand more about how write latency is measured if that information is available.

asked 5 years ago567 views
2 Answers
0

Unfortunately there was a defective underlying storage volume on one of your disk. The defect was detected automatically and the volume was replaced. Your performance should be back to normal.
You should be able to look at the Enhanced Monitoring physical io metrics from this period and see that one of your volumes had elevated latencies for a period

Sorry for the inconvenience.

-Phil

AWS
MODERATOR
philaws
answered 5 years ago
0

As far as I know, the defect was not detected automatically, but required creating a case with AWS support to correct. I'm noting this so that if anyone else runs into the same issue, they won't wait too long expecting it to resolve on its own. We experienced approximately 8 hours of degraded service because of this problem.

It's also worth noting that because RDS performs a synchronous write to both the primary and standby EBS volumes, elevated latency in either AZ can cause this problem. Manual failover will not help because the same two volumes are still being written to by the new primary instance.

answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions