DocumentDB: unexpected query latency spikes

0

I'm currently using DocumentDB 4.0 with the following configuration and stats for a single collection.

  • instances: 3
  • instance size: db.r6g.2xlarge
  • collection stats:
    • count: 23M
    • size: 157G
    • indexes: 6
    • index size: 26G (3G to 7G each)
    • collscans: 38
    • idxscans: 301M
    • cache; collhitratio: 97.3%
    • cache; idxhitratio: 99.8%
  • typical metrics (5min avg)
    • opcounter-query: 7000
    • readIOPs: 24,000
    • The number of read operations is quite consistent
    • freeable memory: 15G to 16G
    • CPU utilization: 25% to 56% (66% in maximum)

In this condition, I get slow queries via profiler like this:

  • execStats.executionTimeMillisEstimate: unpredictable; I can see over 30s
  • execStats.nReturned: 0 to 100
  • execStats.stage: IXSCAN. nothing more than FETCH with IXSCAN. There's no COLLSCAN at all The slow queries showed up consistently, around 1700q/m. It's almost 25% of all queries. HUGE!

I tried the explain() with the same query on DocumentDB directly, but it never showed a slow execution time estimation. It's around 0.015s which I expected exactly. Because almost all queries are covered by the index.

I understand that it'll vary by query and document structure and volume. But regardless of the details, it doesn't seem to be an 'expected result'. My questions are;

  • Can you guess why the query execution plan is so unpredictable?
  • Will a max 66% of CPU utilization affect the query execution? but over the 30s and cannot reproduce in the query execution plan?
  • There's plenty of freeable memory, why does it perform too many disk I/O(ReadIOPs) instead of caching it in a memory buffer?
  • Regards the instance size, are 20,000 IOPs considered as over capability?

Thanks.

GLY
질문됨 3달 전361회 조회
1개 답변
0

Wow, that's a lot of details. The working set of your queries is larger than the available memory. DocumentDB reserves some memory for caching indexes and query results, but if the working set does not fit, it will need to fetch data from disk repeatedly.

You may have queries that are not optimally written and are not taking advantage of indexes. Without seeing the actual queries it is hard to say for certain, but non-optimal queries could lead to more disk I/O.

Concurrency and lock contention could be causing queries to wait on each other. With a high volume of queries, some may be queued behind others. Temporary spikes in load could be causing queueing and slower response times for queries during those periods. The instance size of db.r6g.2xlarge may be insufficient to handle the workload, though the average CPU utilization does not seem exceptionally high.

To better troubleshoot, I would recommend using the DocumentDB profiler to analyze the slowest queries. You can also enable the slow query log to capture query details.

profile picture
전문가
답변함 3달 전
  • Wow! Thank you for your advice. Indeed. There might be some inefficient queries. I'm sure there are no transactions. One thing I couldn't understand is when I copied a slow query -taken over 100ms- in the profiler, and pasted it in the console with explain() so that I could see the execution plan, it estimated way less time say 0.017 or less. Anyway, it seems the instance size is insufficient as you advised. I'll try to upgrade the instance size. Thanks again.

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인