How can I improve the indexing performance on my Amazon OpenSearch Service cluster?

5 minute read

I want to optimize indexing operations in Amazon OpenSearch Service for maximum ingestion throughput. How can I do this?


Be sure that the shards are distributed evenly across the data nodes for the index that you're ingesting into

Use the following formula to confirm that the shards are evenly distributed:

Number of shards for index = k * (Number of data nodes), where k is the number of shards per node

For example, if there are 24 shards in the index, and there are eight data nodes, then OpenSearch Service assigns three shards to each node. For more information, see Get started with Amazon OpenSearch Service: How many shards do I need?

Increase the refresh_interval to 60 seconds or more

Refresh your OpenSearch Service index so that your documents are available for search. Note that refreshing your index requires the same resources that are used by indexing threads.

The default refresh interval is one second. When you increase the refresh interval, the data node makes fewer API calls. The refresh interval can be shorter or faster, depending on the length of the refresh interval. To prevent 429 errors, it's a best practice to increase the refresh interval.

Note: The default refresh interval is one second for indices that receive one or more search requests in the last 30 seconds. For more information about the updated default interval, see _refresh API version 7.x on the Elasticsearch website.

Change the replica count to zero

If you anticipate heavy indexing, consider setting the index.number_of_replicas value to "0." Each replica duplicates the indexing process. As a result, disabling the replicas improves your cluster performance. After the heavy indexing is complete, reactivate the replicated indices.

Important: If a node fails while replicas are disabled, you might lose data. Disable the replicas only if you can tolerate data loss for a short duration.

Experiment to find the optimal bulk request size

Start with the bulk request size of 5 MiB to 15 MiB. Then, slowly increase the request size until the indexing performance stops improving. For more information, see Using and sizing bulk requests on the Elasticsearch website.

Note: Some instance types limit bulk requests to 10 MiB. For more information, see Network limits.

Use an instance type that has SSD instance store volumes (such as I3)

I3 instances provide fast and local memory express (NVMe) storage. I3 instances deliver better ingestion performance than instances that use General Purpose SSD (gp2) Amazon Elastic Block Store (Amazon EBS) volumes. For more information, see Petabyte scale for Amazon OpenSearch Service.

Reduce response size

To reduce the size of OpenSearch Service's response, use the filter_path parameter to exclude unnecessary fields. Be sure that you don't filter out any fields that are required to identify or retry failed requests. Those fields can vary by client.

In the following example, the index-name, type-name, and took fields are excluded from the response:

curl -XPOST "es-endpoint/index-name/type-name/_bulk?pretty&filter_path=-took,-items.index._index,-items.index._type" -H 'Content-Type: application/json' -d'
{ "index" : { "_index" : "test2", "_id" : "1" } }
{ "user" : "testuser" }
{ "update" : {"_id" : "1", "_index" : "test2"} }
{ "doc" : {"user" : "example"} }

For more information, see Reducing response size.

Increase the value of index.translog.flush_threshold_size

By default, index.translog.flush_threshold_size is set to 512 MB. This means that the translog is flushed when it reaches 512 MB. The weight of the indexing load determines the frequency of the translog. When you increase index.translog.flush_threshold_size, the node performs the translog operation less frequently. Because OpenSearch Service flushes are resource-intensive operations, reducing the frequency of translogs improves indexing performance. By increasing the flush threshold size, the OpenSearch Service cluster also creates fewer large segments (instead of multiple small segments). Large segments merge less often, and more threads are used for indexing instead of merging.

Note: An increase in index.translog.flush_threshold_size can also increase the time that it takes for a translog to complete. If a shard fails, then recovery takes more time because the translog is larger.

Before increasing index.translog.flush_threshold_size, call the following API operation to get current flush operation statistics:

curl -XPOST "os-endpoint/index-name/_stats/flush?pretty"

Replace the os-endpoint and index-name with your respective variables.

In the output, note the number of flushes and the total time. The following example output shows that there are 124 flushes, which took 17,690 milliseconds:

     "flush": {
          "total": 124,
          "total_time_in_millis": 17690

To increase the flush threshold size, call the following API operation:

$ curl -XPUT "os-endpoint/index-name/_settings?pretty" -d "{"index":{"translog.flush_threshold_size" : "1024MB"}}"

In this example, the flush threshold size is set to 1024 MB, which is ideal for instances that have more than 32 GB of memory.

Note: Choose the appropriate threshold size for your OpenSearch Service domain.

Run the _stats API operation again to see whether the flush activity changed:

$ curl _XGET "os-endpoint/index-name/_stats/flush?pretty"

Note: It's a best practice to increase the index.translog.flush_threshold_size only for the current index. After you confirm the outcome, apply the changes to the index template.

Related information

Best practices for Amazon OpenSearch Service

AWS OFFICIALUpdated 4 years ago