Sudden Drop in Aurora Cache Hit Rate Leading to Major I/O Bottleneck — Seeking Root Cause

0

We recently experienced a major service disruption that lasted for a significant period due to a database-related issue. While we have identified the initial symptoms of the incident, we have not yet been able to determine the root cause and are seeking assistance from the community.

The disruption appears to have been caused by a sudden drop in the cache hit rate on the data access path, resulting in most query requests reaching Aurora’s underlying storage (network-based distributed storage), which we suspect led to severe I/O bottlenecks. We observed a sharp increase in cache misses in both PostgreSQL’s primary cache (shared_buffers) and Aurora’s secondary cache (NVMe Tiered Cache). As a result, dependency on the slower storage I/O path increased significantly. This led to a state where CPU utilization remained low, but queries were left waiting on I/O, causing system-wide delays. (We’ve attached a relevant image for reference.)

Our primary question is: what could cause such a sudden and sharp drop in cache hit rate? The incident began around 2025-03-18 at approximately 21:00 (UTC+9). There were no deployments or changes made around that time, and incoming traffic levels were consistent with normal patterns, which makes it difficult for us to identify the root cause.

Ultimately, we were able to recover the system by scaling up the Aurora instance, which alleviated the I/O bottleneck. However, we are still seeking to understand why the cache hit rate dropped so drastically in the first place. ** We would greatly appreciate any insights or similar experiences regarding sudden drops in Aurora cache hit rates.**

What we’ve tried / checked so far:

  • Verified no application deployments occurred during the incident window
  • Confirmed no spike in traffic volume or query pattern changes

Enter image description here Enter image description here Enter image description here Enter image description here

asked 2 months ago55 views
1 Answer
0

The sudden drop in cache hit rate leading to a major I/O bottleneck in your Aurora database could be attributed to several factors:

  1. Query pattern changes: Even without a spike in traffic volume, there might have been a shift in the types of queries being executed. If suddenly more queries were accessing data not present in the cache, it could lead to increased cache misses.

  2. Data growth or schema changes: If there was a significant increase in data volume or changes to the database schema, it could affect the effectiveness of existing caches and indexes.

  3. Background processes: Certain background processes like backups, maintenance tasks, or analytics jobs might have started running, causing the cache to be flushed or filled with less frequently accessed data.

  4. Configuration changes: Unintended changes to database configuration parameters, particularly those related to caching and memory allocation, could impact cache performance.

  5. Resource constraints: If the database was approaching memory limits, it might have started evicting cached data more aggressively, leading to more cache misses.

  6. Storage system issues: Problems with the underlying storage system could cause delays in I/O operations, making it appear as if there were more cache misses.

  7. Outdated statistics: If the query planner's statistics were outdated, it might have led to suboptimal query plans that bypass the cache.

To further investigate and prevent future occurrences:

  1. Review query patterns around the time of the incident to identify any unusual activity.
  2. Check for any scheduled jobs or background processes that might have run during that time.
  3. Verify if there were any recent changes to the database schema or significant data growth.
  4. Examine database configuration parameters, especially those related to caching and memory management.
  5. Analyze Performance Insights and CloudWatch metrics for any correlating events or anomalies.
  6. Consider running ANALYZE to update statistics and potentially REINDEX important tables.
  7. Monitor storage performance metrics to ensure there are no underlying I/O issues.

While scaling up the instance provided a temporary solution, it's important to identify and address the root cause to prevent similar incidents in the future. If the issue persists or recurs, engaging AWS support for a more in-depth analysis of your specific Aurora cluster configuration and workload patterns may be beneficial.

Sources
How Mindbody improved query latency and optimized costs using Amazon Aurora PostgreSQL Optimized Reads | AWS Database Blog
Implicit IO limits on Aurora | AWS re:Post
Aurora RDS slow performance | AWS re:Post
Increased wait/io/redo_log_flush on RDS Aurora | AWS re:Post

profile picture
answered 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions