Write latency elevated without any obvious cause

0

We are running MySQL on a db.r3.4xlarge with 4000 provisioned IOPs on SSD. This morning we started seeing increased write latency for long periods, without any apparent increased load to explain it. CPU was under 12% and write IOPs were below 500, and actually went down to around 100 once the average write latency increased from its normal value of around 2 milliseconds up to 150 milliseconds. It almost looked as if something were throttling the writes, despite the fact that we're paying for provisioned capacity and staying well below that limit.

After thorough investigation of our queries, we concluded that it might be caused by hardware degradation of some kind. Since it's a Multi-AZ instance, we rebooted with failover and the problem went away, but then returned a little over an hour later. If this is something we're causing, we'd like to know how to avoid it, but we haven't seen it before in the three years we've been running this instance. Does anyone know what might cause write latency to be increased? It would also be good to understand more about how write latency is measured if that information is available.

preguntada hace 5 años610 visualizaciones
2 Respuestas
0

Unfortunately there was a defective underlying storage volume on one of your disk. The defect was detected automatically and the volume was replaced. Your performance should be back to normal.
You should be able to look at the Enhanced Monitoring physical io metrics from this period and see that one of your volumes had elevated latencies for a period

Sorry for the inconvenience.

-Phil

AWS
MODERADOR
philaws
respondido hace 5 años
0

As far as I know, the defect was not detected automatically, but required creating a case with AWS support to correct. I'm noting this so that if anyone else runs into the same issue, they won't wait too long expecting it to resolve on its own. We experienced approximately 8 hours of degraded service because of this problem.

It's also worth noting that because RDS performs a synchronous write to both the primary and standby EBS volumes, elevated latency in either AZ can cause this problem. Manual failover will not help because the same two volumes are still being written to by the new primary instance.

respondido hace 5 años

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas