Why is my Amazon OpenSearch Service cluster in a red or yellow status?

9 minute read
0

My Amazon OpenSearch Service cluster is in a red or yellow cluster status.

Short description

The Monitoring tab in your OpenSearch Service console indicates the status of the least healthy index in your cluster. A cluster status that shows a red status doesn't mean that your cluster is down. This status indicates that at least one primary shard and its replicas aren't allocated to a node. If your cluster status shows a yellow status, then the primary shards for all indices are allocated to nodes in your cluster. However, one or more replica shards aren't allocated to any of the nodes.

Note: Don't reconfigure your domain until you first resolve the red cluster status. If you try to reconfigure your domain when it's in a red cluster status, it could get stuck in a "Processing" state. For more information about clusters stuck in a "Processing" state, see Why is my OpenSearch Service domain stuck in the "Processing" state?

Your cluster can enter a red status for the following reasons:

  • Multiple data node failures
  • Using a corrupt or red shard for an index
  • High JVM memory pressure or CPU utilization
  • Low disk space or disk skew
  • No replica shards for the unassigned shard

Note: In some cases, you might be able to resolve your red cluster status by deleting and then restoring the index from an automated snapshot.

Your cluster can enter a yellow health status for the following reasons:

  • Creation of a new index
  • Not enough nodes to allocate to the shards or disk skew
  • High JVM memory pressure
  • Single node failure
  • Exceeded the maximum number of shard allocation retries
  • The number of replica shards is more than the number of data nodes
  • Ongoing blue or green deployment due to relocation of data shards

Note: If your yellow cluster status doesn't resolve itself, you can resolve the status by updating the index settings or by manually rerouting the unassigned shards. If your yellow cluster status doesn't self-resolve, then identify and troubleshoot the root cause. To prevent yellow cluster status, apply the Cluster health best practices.

Resolution

Identifying the reason for your unassigned shards

To identify the unassigned shards, perform the following steps:

1.    List the unassigned shard:

$ curl -XGET 'domain-endpoint/_cat/shards?h=index,shard,prirep,state,unassigned.reason' | grep UNASSIGNED

2.    Retrieve the details for why the shard is unassigned:

$ curl -XGET 'domain-endpoint/_cluster/allocation/explain?pretty' -H 'Content-Type:application/json' -d'
{
     "index": "<index name>",
     "shard": <shardId>,
     "primary": <true or false>
}

3.    (Optional) For a red cluster status, delete the indices of concern and identify and address the root cause:

curl -XDELETE 'domain-endpoint/<index names>'

Then, identify the available snapshots and restore your indices from a snapshot:

curl -XGET 'domain-endpoint/_snapshot?pretty'

For yellow cluster status, address the root cause so that your shards are assigned.

Troubleshooting your red or yellow cluster status

Not enough nodes to allocate to the shards

A replica shard won't be assigned to the same node as its primary shard. A single node cluster with replica shards always initializes with yellow cluster status. Single node clusters are initialized this way because there are no other available nodes that OpenSearch Service can assign a replica.

There is also a default limit of "1,000" for the cluster.max_shards_per_node setting for OpenSearch Service versions 7.x and later. It's a best practice to keep the cluster.max_shards_per_node setting to the default value of "1000". If you set shard allocation filters to control how OpenSearch Service allocates shards, the shard can become unassigned from not having enough filtered nodes. To prevent this node shortage, increase your node count. Make sure the number of replicas for every primary shard is less than the number of data nodes. You can also reduce the number of replica shards. For more information, see Sizing OpenSearch Service domains and Demystifying OpenSearch Service shard allocation.

Low disk space or disk skew

If there isn't enough disk space, your cluster can enter a red or yellow health status. There must be enough disk space to accommodate shards before OpenSearch Service distributes the shards.

To check how much storage space is available for each node in your cluster, use the following syntax:

$ curl domain-endpoint/_cat/allocation?v

For more information about storage space issues, see How do I troubleshoot low storage space in my OpenSearch Service domain?

Heavy disk skew can also lead to low storage space issues for some data nodes. If you decide to re-allocate any shards, the shards can become unassigned during the shard distribution. To resolve this issue, see How do I rebalance the uneven shard distribution in my OpenSearch Service cluster?

The disk-based shard allocation settings can also lead to unassigned shards. For example, if the cluster.routing.allocation.disk.watermark.low metric is set to 50 GB, then the specified amount of disk space must be available for shard allocation.

To check the current disk-based shard allocation settings, use the following syntax:

$ curl -XGET domain-endpoint/_cluster/settings?include_defaults=true&flat_settings=true

To resolve your disk space issues, consider the following approaches:

  • Delete any unwanted indices for yellow and red clusters.
  • Delete red indices for red clusters
  • Scale up the EBS volume.
  • Add more data nodes.

Note: Avoid making any configuration changes to your cluster when it's in a red health status. If you try to reconfigure your domain when it's in a red cluster status, it could get stuck in a "Processing" state.

High JVM memory pressure

Every shard allocation uses CPU, heap space, and disk and network resources. Consistently high levels of JVM memory pressure might lead to a failed shard allocation. For example, if JVM memory pressure exceeds 95%, a memory parent circuit breaker is triggered. The allocation thread then gets cancelled, leaving shards unassigned.

To resolve this issue, reduce the JVM memory pressure level first. After your JVM memory pressure has been reduced, consider these additional tips to bring your cluster back to a green health status:

  • Increase the default shard retry value from "5" or higher.
  • Deactivate and activate the replica shard.
  • Manually retry the unassigned shards.

Example API to increase the retry value:

PUT <index-name>/_settings
{ 
 "index.allocation.max_retries" : <value>
}

For more information about reducing your JVM memory pressure, see How do I troubleshoot high JVM memory pressure on my OpenSearch Service cluster?

Node failure

When your cluster experiences a node failure, shards that are allocated to a node become unassigned. When there are no replica shards available for a given index, even a single node failure can cause red health status. Having two replica shards and a Multi-AZ deployment protects your cluster against data loss during a hardware failure.

If all your indices have a replica shard, a single node failure can cause your cluster to temporarily enter a yellow health status. If your cluster temporarily enters a yellow health status, then OpenSearch Service will recover automatically as soon as the node is healthy again. Or, OpenSearch Service will recover when shards are assigned to a new node.

You can confirm node failures by checking your Amazon CloudWatch metrics. For more information about identifying a node failure, see Failed cluster nodes.

Note: It's also a best practice to assign one replica shard for each index or to use dedicated primary nodes and activate zone awareness. For more information, see Coping with failure on the Elasticsearch website.

Exceeded the maximum number of retries

In OpenSearch Service, your cluster must not exceed the maximum time limit (5,000 ms) or the number of retries (5) for shard allocation. If your cluster has reached the maximum thresholds, you must manually trigger a shard allocation. To manually trigger a shard allocation, deactivate and re-activate the replica shards for your indices.

A configuration change on your cluster can also trigger shard allocation. For more information about shard allocation, see Every shard deserves a home on the Elasticsearch website.

Note: It's not a best practice to manually trigger shard allocation if your cluster has a heavy workload. If you remove all your replicas from an index, the index must rely only on primary shards. When a node fails, your cluster then enters a red health status because the primary shards are left unassigned.

To deactivate a replica shard, update the number_of_replicas value to "0":

$ curl -XPUT 'domain-endpoint/<indexname>/_settings' -H 'Content-Type: application/json' -d'
{
     "index" : {
          "number_of_replicas" : 0
     }
}

Also, check to make sure the index.auto_expand_replicas setting is set to "false". When your cluster returns to a green status, you can set the index.number_of_replicas value back to the desired value to trigger allocation for replica shards. If the shard allocation is successful, your cluster will enter a green health status.

Cluster health best practices

To resolve your yellow or red cluster status, consider the following best practices:

  • Set a recommended Amazon CloudWatch alarm for AutomatedSnapshotFailure. With the alarm, you can make sure you have a snapshot available to restore your indices from when your cluster enters a red status.
  • If your cluster is under a sustained heavy workload, scale your cluster. For more information about scaling your cluster, see How can I scale up an OpenSearch Service domain?
  • Monitor your disk usage, JVM memory pressure, and CPU usage and make sure they are not exceeding set thresholds. For more information, see Recommended CloudWatch alarms and Cluster metrics.
  • Make sure all primary shards have replica shards to protect against node failures.

For more information, see Operational best practices for Amazon OpenSearch Service.

AWS OFFICIAL
AWS OFFICIALUpdated 6 months ago
2 Comments

Exceeded the maximum number of retries

When we reach this state and if we decrease the replica set to ZERO, my running system will be impacted if the data size is large.

So instead of this we should increase the max retries count after solving the error due to which the shard was not getting allocated.

PUT /<yellow-index-name>/_settings{
     "index.allocation.max_retries": 10
}
replied 5 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 5 months ago