How do I troubleshoot and resolve high CPU utilization on my Amazon DocumentDB instances?
I want to troubleshoot the high CPU utilization on my Amazon DocumentDB (with MongoDB compatibility) instances.
Short description
The CPU utilization of your Amazon DocumentDB instances help you understand how your currently allocated resources perform for the ongoing workload.
You might see an increase in CPU utilization for the following reasons:
- User-initiated heavy workloads
- Non-efficient queries
- The writer in the cluster is overburdened because the read load isn't balanced in the cluster
- The reader is of lower hardware configuration than the writer and can't sync up with the high write workload
- Internal tasks such as garbage collection in the Amazon DocumentDB cluster
- Too many database connections (idle)
- Short bursts of connections
Resolution
Use Amazon CloudWatch metrics
Use CloudWatch to gather and analyze operational metrics for your clusters. Use the CloudWatch metrics to identify CPU and its proportional metric patterns over extended periods.
Review and monitor the following metrics in the CloudWatch console:
- Use DatabaseConnections and DatabaseConnectionsMax to identify the number of connections open at a relevant timeline.
- Use WriteIOPs, ReadIOPs, ReadThroughput, and WriteThroughput to understand the overall workload on your Amazon DocumentDB instance.
- Use DocumentsDeleted, DocumentsInserted, DocumentsReturned and DocumentsUpdated to understand the user workload on your Amazon DocumentDB instance.
- If you use the T3 or T4 instance classes, then review CPUCreditBalance and CPUSurplusCreditBalance to check for compute throttling.
Use Performance Insights metrics
Use Amazon DocumentDB Performance Insights to identify queries that contribute to database load and wait state. Under the Manage Metrics option, use the average active sessions to review the load and CPU distribution (system, user or total).
A heavy load occurs when the load average exceeds the number of vCPUs on the instance. However if the load average is less than the vCPU count for the DB instance class, then CPU throttling might not be the cause of your application latency. To identify the cause of the increased CPU usage, review the load average and analyze wait states related to I/O, locks, and latches.
Use native database queries
Use native queries to analyze the workload and check the CPU usage. To list all the operations that currently run on an Amazon DocumentDB instance, use the MongoDB shell to run the following query:
db.adminCommand({currentOp: 1, $all: });
To list all queries that are either blocked or run for more than 10 seconds, run the following query that uses the currentOp command:
db.adminCommand({ aggregate: 1, pipeline: [ {$currentOp: {}}, {$match: { $or: [ {secs_running: {$gt: 10}}, {WaitState: {$exists: true}} ] }}, {$project: { _id:0, opid: 1, secs_running: 1, WaitState: 1, blockedOn: 1, command: 1 }} ], cursor: {} });
To analyze the system usage results, run the following query on the instance that has high CPU usage on:
db.adminCommand({ aggregate: 1, pipeline: [ { $currentOp: { allUsers: true, idleConnections: true } }, { $group: { _id: { desc: "$desc", ns: "$ns", WaitState: "$WaitState" }, count: { $sum: 1 } } } ], cursor: {} });
The preceding query returns an aggregate of all queries that run in each namespace. It also lists all internal system tasks and the unique number of wait states per namespace.
Note: The GARBAGE_COLLECTION metric under the internal tasks is the multi-version concurrency control (MVCC) implementation in the Amazon DocumentDB cluster. This is a background sweeper that removes dead document versions and correlates with the number of updates or deletes in your database. Amazon DocumentDB starts the sweeping process based on internal thresholds at a collection level and results in read or write IOPs and CPU usage.
Check the efficiency of queries
Check index overhead for write queries
Excessive or unused indexes can slow down write operations. To improve performance, check your index usage statistics to identify and remove unnecessary indexes.
Check explain-plan of the query
When a query needs to search through every document in a collection, it becomes slow. Create appropriate indexes to improve the speed of the query.
Use the explain command to identify the fields that you want to create indexes on. You can also use profiler logs to capture long running queries and the details of their operations.
Check statistics of collections
Check the following statistics for the collections you use:
- Review the Top Queries section in Performance Insights to identify the collections that contribute the most to the load.
- Review the collection's statistics to understand how many insert, update, and delete operations DocumentDB performs. You can also check how many index scans and full collection scans occur.
- Split your collections to reduce the document size you need to process, particularly if you have a large number of update operations.
Check the Aggressive Logging settings
Amazon DocumentDB prioritizes event auditing over database traffic. If you don't need auditing, then you can turn it off. If you require auditing, then set the audit_logs parameter to log only necessary events. Plan for increased load, and switch to a bigger instance class when needed.
To avoid aggressive logging of profiler logs, make sure that you set the correct value for the profiler_threshold_ms parameter. Review your application workload to identify the correct threshold you require to categorize a query as long running.
Activate the log exports option for the logs that you want to export to CloudWatch.
Use best practices
Offload the read workload to reader
If you have multiple DB instances in your Amazon DocumentDB cluster, offload the read workload to your reader instance. When you connect as a replica set, specify the readPreference for the connection. If you specify a read preference of secondaryPreferred, then the client tries to route the read queries to your replicas. The client tries to route write queries to your primary DB instance.
Note: Readers have eventual consistency. If a workload requires stronger read-after-write consistency, then use dynamic read preference and override it on query level. For example, you might default to secondaryPreferred at connection level so queries go to secondary. If you have queries that require stronger read-after-write consistency, then you can override the default and read from the primary node.
Example:
db.collection.find().readPref("primary")
Add one or more reader instances to the cluster
If you have an Amazon DocumentDB cluster with a single DB instance (writer only), then add one or multiple reader DB instances to the cluster. Then use readPreference=secondaryPreferred to handle the load efficiently.
Use Amazon DocumentDB Profiler to identify slow queries
Use the Amazon DocumentDB Profiler to log slow queries. If a query appears repeatedly in the slow query logs, then you might need an additional index to improve performance.
Check for long running queries that include COLLSCAN stages in their execution plan. A COLLSCAN stage means that the query must read through every document in the collection to provide a response to the query.
For more information, see Profiling slow-running queries in Amazon DocumentDB (with MongoDB compatibility).
Create an alarm notification with CloudWatch
Create a CloudWatch alarm that notifies you when the CPU Utilization metric exceeds a specific threshold.
Scale up the instance class of your DB instances
If there's no further scope of query tuning, then scale up the instance class of instances in the cluster to handle the workload.
Note: If you scale up an instance class, then this increases the cost. For more information, see Amazon DocumentDB (with MongoDB compatibility) pricing.
Related information
Scaling Amazon DocumentDB clusters
Troubleshooting performance and resource utilization
How to index on Amazon DocumentDB (with MongoDB compatibility)
- Topics
- Database
- Language
- English

Relevant content
- Accepted Answerasked 2 years ago
- asked a year ago
- asked 2 years ago