How do I resolve the "Courier fetch: n of m shards failed" error in OpenSearch Dashboards on Amazon OpenSearch Service?

5 minute read
0

When I try to load a dashboard in OpenSearch Dashboards on my Amazon OpenSearch Service domain, it returns a Courier fetch error.

Short description

When you load a dashboard in OpenSearch Dashboards, a search request is sent to the OpenSearch Service domain. The search request is routed to a cluster node that acts as the coordinating node for the request. For more information, see Creating a cluster on the OpenSearch website. The "Courier fetch: n of m shards failed" error occurs when the coordinating node fails to complete the fetch phase of the search request. For more information, see Fetch Phase on the Elasticsearch website. This error occurs because of one of the following causes:

  • Persistent issues: Mapping conflicts or unassigned shards. You might get a Courier fetch error when you have indices in your index pattern with the same name but with a different mapping type. For more information, see Mapping on the Elasticsearch website. If your cluster is in red cluster status, then at least one shard is unassigned. Because OpenSearch Service can't fetch documents from unassigned shards, a cluster in red status results in a Courier fetch error. If the value of "n" in the Courier fetch error message is the same each time that you receive the error, then it's a persistent issue. To troubleshoot the issue, check the application error logs.
    Note: You can't resolve persistent issues by retrying or provisioning more cluster resources.
  • Transient issues: Transient issues include rejections of thread pools, search timeouts, and tripped field data circuit breakers. For more information, see Thread pools, timeout, and Field data circuit breaker on the Elasticsearch website. These issues occur when you don't have enough compute resources on the cluster. A transient issue is likely the cause when you receive the error message intermittently but with a different value of "n" each time. To determine whether a transient issue causes the Courier fetch error, monitor the following Amazon CloudWatch metrics, CPUUtilization, JVMMemoryPressure, and ThreadpoolSearchRejected.

Resolution

Activate application error logs for the domain. The logs can help you identify the root cause and solution for both persistent and transient issues. For more information, see Viewing Amazon OpenSearch Service error logs.

Persistent issues

The following example is a log entry for a Courier fetch error caused by a persistent issue:

[2019-07-01T12:54:02,791][DEBUG][o.e.a.s.TransportSearchAction] [ip-xx-xx-xx-xxx] [1909731] Failed to execute fetch phaseorg.elasticsearch.transport.RemoteTransportException: [ip-xx-xx-xx-xx][xx.xx.xx.xx:9300][indices:data/read/search[phase/fetch/id]]Caused by: java.lang.IllegalArgumentException: Fielddata is disabled on text fields by default. Set fielddata=true on [request_departure_date] in order to load fielddata in memory by uninverting the inverted index.Note that this can however use significant memory. Alternatively use a keyword field instead.

In this example, the issue is caused by the request_departure_date field. The log entry shows that to resolve this issue, you can set fielddata=true in the index settings or use a keyword field.

Transient issues

To resolve most transient issues, either increase compute resource provisions or reduce resource utilization for your queries.

Provisioning more compute resources

Reducing the resource utilization for your queries

  • Confirm that you follow best practices for shard and cluster architecture. A poorly designed cluster can't use all available resources. Some nodes get overloaded while other nodes sit idle. OpenSearch Service can't fetch documents from overloaded nodes.
  • Reduce the scope of your query. For example, if you query on time frame, then reduce the date range or filter the results to configure the index pattern in OpenSearch Dashboards.
  • Avoid running select * queries on large indices. Instead, use filters to query a part of the index and search as few fields as possible. For more information, see Tune for search speed and Query and filter context on the Elasticsearch website.
  • Re-index and reduce the number of shards. The more shards that you have in your cluster, the more likely you are to get a Courier fetch error. Because each shard has its own resource allocation and overheads, a large number of shards places excessive strain on your cluster. For more information, see Why is my Amazon OpenSearch Service domain stuck in the "Processing" state?

The following example is a log entry for a Courier fetch error caused by a transient issue:

Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@26fdeb6f on QueueResizingEsThreadPoolExecutor[name = __PATH__ queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 2.9ms, adjustment amount = 50,org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@1968ac53[Running, pool size = 2, active threads = 2, queued tasks = 1015, completed tasks = 96587627]]

In this example, the issue is caused by search thread pool queue rejections. To resolve this issue, choose a larger instance type to scale up your domain.

Related information

Operational best practices for Amazon OpenSearch Service

Troubleshooting Amazon OpenSearch Service

AWS OFFICIAL
AWS OFFICIALUpdated 3 months ago