DocumentDB: unexpected query latency spikes

0

I'm currently using DocumentDB 4.0 with the following configuration and stats for a single collection.

  • instances: 3
  • instance size: db.r6g.2xlarge
  • collection stats:
    • count: 23M
    • size: 157G
    • indexes: 6
    • index size: 26G (3G to 7G each)
    • collscans: 38
    • idxscans: 301M
    • cache; collhitratio: 97.3%
    • cache; idxhitratio: 99.8%
  • typical metrics (5min avg)
    • opcounter-query: 7000
    • readIOPs: 24,000
    • The number of read operations is quite consistent
    • freeable memory: 15G to 16G
    • CPU utilization: 25% to 56% (66% in maximum)

In this condition, I get slow queries via profiler like this:

  • execStats.executionTimeMillisEstimate: unpredictable; I can see over 30s
  • execStats.nReturned: 0 to 100
  • execStats.stage: IXSCAN. nothing more than FETCH with IXSCAN. There's no COLLSCAN at all The slow queries showed up consistently, around 1700q/m. It's almost 25% of all queries. HUGE!

I tried the explain() with the same query on DocumentDB directly, but it never showed a slow execution time estimation. It's around 0.015s which I expected exactly. Because almost all queries are covered by the index.

I understand that it'll vary by query and document structure and volume. But regardless of the details, it doesn't seem to be an 'expected result'. My questions are;

  • Can you guess why the query execution plan is so unpredictable?
  • Will a max 66% of CPU utilization affect the query execution? but over the 30s and cannot reproduce in the query execution plan?
  • There's plenty of freeable memory, why does it perform too many disk I/O(ReadIOPs) instead of caching it in a memory buffer?
  • Regards the instance size, are 20,000 IOPs considered as over capability?

Thanks.

GLY
asked 3 months ago344 views
1 Answer
0

Wow, that's a lot of details. The working set of your queries is larger than the available memory. DocumentDB reserves some memory for caching indexes and query results, but if the working set does not fit, it will need to fetch data from disk repeatedly.

You may have queries that are not optimally written and are not taking advantage of indexes. Without seeing the actual queries it is hard to say for certain, but non-optimal queries could lead to more disk I/O.

Concurrency and lock contention could be causing queries to wait on each other. With a high volume of queries, some may be queued behind others. Temporary spikes in load could be causing queueing and slower response times for queries during those periods. The instance size of db.r6g.2xlarge may be insufficient to handle the workload, though the average CPU utilization does not seem exceptionally high.

To better troubleshoot, I would recommend using the DocumentDB profiler to analyze the slowest queries. You can also enable the slow query log to capture query details.

profile picture
EXPERT
answered 3 months ago
  • Wow! Thank you for your advice. Indeed. There might be some inefficient queries. I'm sure there are no transactions. One thing I couldn't understand is when I copied a slow query -taken over 100ms- in the profiler, and pasted it in the console with explain() so that I could see the execution plan, it estimated way less time say 0.017 or less. Anyway, it seems the instance size is insufficient as you advised. I'll try to upgrade the instance size. Thanks again.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions