Skip to content

Intermittent 0 Records Issue During Parallel Queries on Aurora MySQL Reader Instance

0

Hi everyone,

We are experiencing an intermittent issue with our Amazon Aurora MySQL cluster (version: 8.0.mysql_aurora.3.08.0).

During parallel execution (via AWS Lambda), some queries occasionally return 0 records, even though the data exists. When the same queries are retried after a short delay, they return the correct results. A few key observations:

This issue occurs only when queries are executed in parallel. The problem is observed on the reader instance, while the same queries on the writer instance consistently return correct results. Retrying the same query after a brief delay resolves the issue, indicating that the data is not missing but temporarily inconsistent.

And it should not be due to the replication lag because the data is there from before, not changing when the query runs.

Environment details:

Engine: Amazon Aurora MySQL Version: 8.0.mysql_aurora.3.08.0 Setup: Writer + Reader instances Execution: Parallel Lambda invocations

Has anyone faced a similar issue with Aurora reader instances under parallel load?

Any insights on this or how best to handle this scenario would be really helpful. Thanks in advance!

Sample Query: SELECT DISTINCT t1.id, t1.store_id, t2.marketplace_type, t2.id AS marketplace_id,

COALESCE(t4.product_sku, t3.product_sku) AS productSku,
COALESCE(t4.brand, t3.brand) AS productBrand,
COALESCE(t4.id, t3.id) AS listing_id,
COALESCE(t4.product_type, t3.product_type) AS product_type,
COALESCE(t4.listing_master_id, t3.listing_master_id) AS kitted_listing_master_id,

COALESCE(t5.marketplace_posting_id, t1.marketplace_posting_id) AS marketplace_posting_id,
COALESCE(t5.price_type, t1.price_type) AS price_type,
COALESCE(t5.default_selling_price, t1.default_selling_price) AS default_selling_price,

t3.product_sku AS parent_product_sku,
t3.brand AS parent_product_brand,
t3.category_id AS parent_category_id,
t3.default_vendor AS parent_default_vendor,
COALESCE(t4.default_vendor, t3.default_vendor) AS default_vendor

FROM table_1 AS t1

LEFT JOIN table_2 AS t2 ON t1.marketplace_id = t2.id

LEFT JOIN table_3 AS t3 ON t1.listing_id = t3.id AND t3.tenant_id = :tenant_id

LEFT JOIN table_4 AS t4 ON t3.id = t4.variant_parent_id AND t4.product_type = :child_product_type AND t4.deleted_at IS NULL AND t4.tenant_id = :tenant_id

LEFT JOIN table_5 AS t5 ON t4.id = t5.listing_id AND t1.store_id = t5.store_id AND t5.store_id = :store_id

WHERE t1.store_id = :store_id AND t3.product_type NOT IN (:type1, :type2, :type3) AND t3.deleted_at IS NULL AND t1.delist_status NOT IN (:status)

ORDER BY t1.id DESC

LIMIT 30000 OFFSET 60000;

asked a month ago71 views
2 Answers
3

I guess, the most probable cause is a edge case with Aurora Parallel Query (PQ) on reader instances. Under high concurrency, the coordination between the DB engine and the Storage Layer can occasionally fail to reconstruct the read-view correctly for a parallel scan, resulting in an empty result set instead of an error.

Try disable Parallel Query at the session level within your Lambda function to test this hypothesis:

SET aurora_parallel_query = OFF;

If the intermittent "0 records" issue disappears, you have identified the root cause. This is a much more likely culprit than transaction isolation levels, especially since your data isn't changing during execution.

EXPERT
answered a month ago
1

Based on your description, this intermittent issue with parallel queries returning 0 records on Aurora MySQL reader instances is likely related to transaction isolation levels and how parallel query execution interacts with reader instances.

Key Issue: Isolation Level on Reader Instances

Aurora MySQL reader instances have specific isolation level requirements for parallel query execution. By default, reader instances use the REPEATABLE READ isolation level. However, parallel query on reader instances can also work with READ COMMITTED isolation level (available in Aurora MySQL version 2.09 and higher, which includes your version 3.08.0).

The critical point is that READ COMMITTED isolation level behaves differently on reader instances compared to writer instances - it's less strict on readers. This difference in isolation behavior could explain why you're seeing inconsistent results during parallel execution on readers while the writer returns consistent results.

Potential Causes:

  1. Long-running transactions: If there are long-running transactions on your reader instance, they can prevent parallel query execution for subsequent queries, potentially causing inconsistent behavior.

  2. Temporary table issues: Your query uses DISTINCT, multiple LEFT JOINs, and LIMIT with OFFSET, which can generate temporary tables during execution. Under parallel load, there have been known issues with temporary table handling that can cause unexpected behavior, including database restarts or incorrect query results in certain scenarios.

  3. Parallel query execution plan instability: There have been documented cases where Aurora MySQL reader instances can experience issues while executing queries that utilize parallel query execution plans, particularly under concurrent load.

Recommended Solutions:

  1. Set explicit isolation level: Try setting the isolation level explicitly at the session level in your Lambda functions when connecting to the reader. You might want to test with READ COMMITTED isolation level using the aurora_read_replica_read_committed configuration option.

  2. Optimize your query: Your query has several characteristics that could be problematic:

  • The LIMIT with large OFFSET (60000) is inefficient and can cause issues
  • Multiple LEFT JOINs with COALESCE operations increase complexity
  • Consider using appropriate indexes for the JOIN conditions and WHERE clauses
  • The DISTINCT operation combined with ORDER BY and LIMIT can generate large temporary tables
  1. Consider disabling parallel query temporarily: Test whether disabling parallel query for this specific query resolves the issue. This would help confirm if parallel query execution is the root cause.

  2. Monitor for specific conditions: Check if the issue correlates with:

  • Large UPDATE or DDL operations running on the writer instance
  • High concurrent query load on the reader
  • Buffer pool state and table statistics
  1. Query rewrite: Consider rewriting the query to avoid the large OFFSET. Use a keyset pagination approach instead (WHERE id < last_seen_id ORDER BY id DESC LIMIT 30000), which is more efficient and less prone to consistency issues.

The fact that retrying resolves the issue suggests a timing-related problem with how the reader instance handles parallel execution under load, possibly related to temporary table materialization or isolation level handling during concurrent access.
Sources
SQL constructs for parallel query in Aurora MySQL - Amazon Aurora
Troubleshoot table errors and storage full issues in Aurora My-SQL Compatible databases | AWS re:Post
Aurora MySQL database engine updates 2023-04-17 (version 3.02.3) (Deprecated) - Amazon Aurora

answered a month ago
EXPERT
reviewed a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.