Skip to content

AWS RDS is giving us IO waits on our queries, system is borderline completely unresponsive

0

Our production RDS Postgres is suddenly showing severe IO waits, holding queries indefinitely. These are the same queries that have run fine for years, and EXPLAIN shows nothing unusual.

Rebooting the RDS instance temporarily helps, but the system pretty much freezes right after. We're using gp3 for our storage.

No recent code or schema changes.

Could this be a memory leak or resource exhaustion? What RDS/Postgres issues commonly cause sudden IO wait spikes? It really seems like we're having disk issues. What should we do about them.

  • 2nd dev here.

    We’ve been running a stable RDS PostgreSQL instance (db.t3.medium) for quite a while without issues. Starting yesterday, we noticed a sudden and sustained slowdown in query performance across our application, without any major changes on our side.

    Key observations so far:

    • Queries that normally complete instantly are now taking significantly longer.
    • Autovacuum and query plans haven’t changed; nothing in our schema or workload has shifted recently.
    • We checked cache hit ratios by table, and several high-read tables now show unexpectedly low cache hit percentages (e.g. Table A at ~35%, Table B at ~12%). This suggests far more data is being read from storage instead of memory.
    • Our workload has not grown enough to explain this drastic change.
    • Instance-level monitoring shows increased IO wait, aligning with what feels like degraded underlying EBS or network throughput to storage.

    So far we’ve:

    • upgraded gp2 to gp3 to remove any IOPS throttling
    • Compared before/after cache hit ratios, confirming the workload pattern is the same but performance is degraded.

    Given these findings, we are concerned this may be related to the underlying EBS volumes or networking between RDS and EBS, which is outside of our control.

    Question: Can AWS confirm whether there are any known issues with the underlying EBS or networking in our region/availability zone, or if our instance may be experiencing degraded I/O capacity?

  • Can't even download the error log files. Just downloads a 0 byte file.

asked 10 months ago408 views

1 Answer
-2

Based on your description of sudden IO wait spikes in your RDS PostgreSQL instance, this sounds like an I/O bottleneck issue. Here are some likely causes and solutions:

Possible causes:

  1. Connection spikes: A sudden increase in database connections can overwhelm your instance, causing I/O waits. Check your CloudWatch metrics for spikes in the DatabaseConnections metric.

  2. Storage throughput limits: Your RDS instance may be hitting the IOPS limits of your storage volume. If you're using gp2 storage, you might be reaching its burstable IOPS limit during peak load.

  3. Sequential scans: Your queries might be performing sequential scans on large tables, which can cause IO:DataFileRead wait events when the required pages aren't available in shared memory.

  4. Buffer pool size: If your shared buffer pool is too small for your workload, it will cause excessive disk reads.

  5. Background processes: Multiple vacuum workers running simultaneously or maintenance operations can cause I/O contention.

Recommended actions:

  1. Check wait events: Look for IO:DataFileRead events, which occur when a connection waits for a backend process to read a page from storage because it's not in shared memory.

  2. Monitor storage metrics: Review CloudWatch metrics like DiskQueueDepth, ReadLatency, WriteLatency, ReadIOPS, WriteIOPS, ReadThroughput, and WriteThroughput to confirm I/O bottlenecks.

  3. Upgrade storage: If you're using gp2 storage, consider upgrading to gp3, which allows scaling IOPS independently from storage and has a higher baseline IOPS.

  4. Scale up your instance: A higher-capacity instance class provides more memory for the shared buffer pool and more vCPUs to handle connections.

  5. Optimize for read workloads: If applicable, take advantage of reader nodes or read replicas to distribute read operations and reduce I/O pressure on the writer node.

  6. Check for table bloat: Tables consuming unnecessary space can cause performance issues. Use the pgstattuple extension to identify tables with bloat.

  7. Review autovacuum settings: Ensure autovacuum is properly configured to prevent table bloat.

The fact that rebooting temporarily helps suggests that this could be related to memory management or buffer cache issues that are temporarily resolved by clearing memory, but then quickly recur as the system operates.
Sources
Capture and diagnose I/O bottlenecks on Amazon RDS for SQL Server | AWS Database Blog
IO:DataFileRead - Amazon Aurora
RDS for PostgreSQL wait events - Amazon Relational Database Service
IO:DataFileRead - Amazon Relational Database Service
RDS for PostgreSQL wait events - Amazon Relational Database Service

answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.