How do I troubleshoot high CPU utilization on my Amazon OpenSearch Service cluster?

7 minute read
1

My data nodes are showing high CPU usage on my Amazon OpenSearch Service cluster.

Short description

It's a best practice to maintain your CPU utilization to make sure that OpenSearch Service has enough resources to perform its tasks. A cluster that consistently performs at high CPU utilization can degrade cluster performance. When your cluster is overloaded, OpenSearch Service stops responding, resulting in a timeout request.

To troubleshoot high CPU utilization on your cluster, consider the following approaches:

  • Use the nodes hot threads API.
  • Check the write operation or bulk API thread pool.
  • Check the search thread pool.
  • Check the Apache Lucene merge thread pool.
  • Check the JVM memory pressure.
  • Review your sharding strategy.
  • Optimize your queries.

Resolution

Use the nodes hot threads API

If there are constant CPU spikes in your OpenSearch Service cluster, then use the nodes hot threads API. The nodes hot threads API acts as a task manager, showing you the breakdown of all resource-intensive threads that are running on your cluster.

Example output of the nodes hot threads API:

GET _nodes/hot_threads

100.0% (131ms out of 500ms) cpu usage by thread 'opensearch[xxx][search][T#62]'
10/10 snapshots sharing following 10 
elements sun.misc.Unsafe.park(Native Method) 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) 
java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737)
 
java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
 
java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1269)
 
org.opensearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

Note: The nodes hot threads output lists information for each node. The length of your output depends on how many nodes are running in your OpenSearch Service cluster.

Additionally, use the cat nodes API to view the current breakdown of resource utilization. You can narrow down the subset of nodes with the highest CPU utilization with the following command:

GET _cat/nodes?v&s=cpu:desc

The last column in your output displays your node name. For more information, see cat nodes API on the Elasticsearch website.

Then, pass on the relevant node name to your hot threads API:

GET _nodes/<node-name>/hot_threads

For more information, see Hot threads API on the Elasticsearch website.

Example nodes hot threads output:

<percentage> of cpu usage by thread 'opensearch[<nodeName>][<thread-name>]'

The thread name indicates which OpenSearch Service processes are consuming high CPU.

For more information, see Nodes hot threads API on the Elasticsearch website.

Check the write operation or bulk API thread pool

A 429 error in OpenSearch Service might indicate that your cluster is handling too many bulk indexing requests. When there are constant CPU spikes in your cluster, OpenSearch Service rejects the bulk indexing requests.

The write thread-pool handles indexing requests, which include Bulk API operations. To confirm whether your cluster is handling too many bulk indexing requests, check the IndexingRate metric in Amazon CloudWatch.

If your cluster is handling too many bulk indexing requests, then consider the following approaches:

  • Reduce the number of bulk requests on your cluster.
  • Reduce the size of each bulk request, so that your nodes can process them more efficiently.
  • If Logstash is used to push data into your OpenSearch Service cluster, then reduce the batch size or the number of workers.
  • If your cluster's ingestion rate slows down, then scale your cluster (either horizontally or vertically). To scale up your cluster, increase the number of nodes and instance type so that OpenSearch Service can process the incoming requests.

For more information, see Bulk API on the Elasticsearch website.

Check the search thread pool

A search thread pool that consumes high CPU indicates that search queries are overwhelming your OpenSearch Service cluster. Your cluster can be overwhelmed by a single long-running query. An increase in queries performed by your cluster can also affect your search thread pool.

To check whether a single query is increasing your CPU usage, use the task management API. For example:

GET _tasks?actions=*search&detailed

The task management API gets all active search queries that are running on your cluster. For more information, see Task management API on the Elasticsearch website.

Note: The output only includes the description field if there's a search task listed by the task management API.

Example output:

{
    "nodes": {
        "U4M_p_x2Rg6YqLujeInPOw": {
            "name": "U4M_p_x",
            "roles": [
                "data",
                "ingest"
            ],
            "tasks": {
                "U4M_p_x2Rg6YqLujeInPOw:53506997": {
                    "node": "U4M_p_x2Rg6YqLujeInPOw",
                    "id": 53506997,
                    "type": "transport",
                    "action": "indices:data/read/search",
                    "description": """indices[*], types[], search_type[QUERY_THEN_FETCH], source[{"size":10000,"query":{"match_all":{"boost":1.0}}}]""",
                    "start_time_in_millis": 1541423217801,
                    "running_time_in_nanos": 1549433628,
                    "cancellable": true,
                    "headers": {}
                }
            }
        }
    }
}

Check the description field to identify which query is run. The running_time_in_nanos field indicates the amount of time a query been running. To decrease your CPU usage, cancel the search query that's consuming high CPU. The task management API also supports a _cancel call.

Note: Make sure to record the task ID from your output to cancel a particular task. In this example, the task ID is "U4M_p_x2Rg6YqLujeInPOw:53506997".

Example task management POST call:

POST _tasks/U4M_p_x2Rg6YqLujeInPOw:53506997/_cancel

The Task Management POST call marks the task as "cancelled", releasing any dependent AWS resources. If you have multiple queries running on your cluster, then use the POST call to cancel queries one at a time. Cancel each query until your cluster returns to a normal state. It's also a best practice to set a timeout value in the query body, to prevent high CPU spikes. For more information, see Request body search parameters on the Elasticsearch website. To verify if the number of active queries has decreased, check the SearchRate metric in Amazon CloudWatch.

Note: Canceling all active search queries at the same time in your OpenSearch Service cluster can cause errors on the client application side.

For more information, see Thread pools on the Elasticsearch website.

Check the Apache Lucene merge thread pool

OpenSearch Service uses Apache Lucene for indexing and searching documents on your cluster. Apache Lucene runs merge operations to reduce the effective number of segments needed for each shard and to remove any deleted documents. This process is run whenever new segments are created in a shard.

If you observe an Apache Lucene merge thread operation impacting the CPU usage, then increase the refresh_interval setting of your OpenSearch Service cluster indices. The increase in the refresh_interval setting slows down segment creation of your cluster.

Note: A cluster that's migrating indices to UltraWarm storage can increase your CPU utilization. An UltraWarm migration usually involves a force merge API operation, which can be CPU-intensive.

To check for UltraWarm migrations, use the following command:

GET _ultrawarm/migration/_status?v

For more information, see Merge on the Elasticsearch website.

Check the JVM memory pressure

Review your JVM memory pressure percentage of the Java heap in a cluster node. If JVM memory pressure reaches 75%, Amazon OpenSearch Service triggers the Concurrent Mark Sweep (CMS) garbage collector. If JVM memory pressure reaches 100%, then OpenSearch Service JVM is configured to exit and eventually restarts on OutOfMemory (OOM).

In the following example log, the JVM is within the recommended range, but the cluster is impacted by long running garbage collection:

[2022-06-28T10:08:12,066][WARN ][o.o.m.j.JvmGcMonitorService] 
[515f8f06f23327e6df3aad7b2863bb1f] [gc][6447732] overhead, spent [9.3s] 
collecting in the last [10.2s]

For more information, see How do I troubleshoot high JVM memory pressure on my Amazon OpenSearch Service cluster?

Review your sharding strategy

Depending on the cluster size, your cluster might degrade in performance with too many shards. It's a best practice to have no more than 25 shards per GiB of Java heap.

By default, Amazon OpenSearch Service has a sharding strategy of 5:1, where each index is divided into five primary shards. Within each index, each primary shard also has its own replica. OpenSearch Service automatically assigns primary shards and replica shards to separate data nodes, and makes sure that there's a backup in case of failure.

For more information, see How do I rebalance the uneven shard distribution in my Amazon OpenSearch Service cluster?

Optimize your queries

Heavy aggregations, wildcard queries (especially leading wildcards), and regex queries might be computationally expensive and cause CPU utilization spikes. Searching slow logs and indexing slow logs can help you diagnose expensive and problematic queries.

For more information, see Monitoring OpenSearch logs with Amazon CloudWatch Logs.

Related information

How can I improve the indexing performance on my Amazon OpenSearch Service cluster?

How do I resolve search or write rejections in Amazon OpenSearch Service?

Sizing Amazon OpenSearch Service domains

AWS OFFICIAL
AWS OFFICIALUpdated a year ago