How do I troubleshoot unassigned shards in my OpenSearch Service cluster?

6 minute read
0

I have unassigned shards in my Amazon OpenSearch Service cluster.

Short description

You might have unassigned shards in your OpenSearch Service due to:

  • Failed cluster nodes
  • Misconfigured replica count
  • Shard failed to get an in-memory lock
  • Shard limit exceeded
  • Disk space issues
  • Disk usage skewed and sharding strategy
  • ClusterBlockException error

Resolution

Identifying the reason for your unassigned shards

To identify the unassigned shards and get additional details, perform the following steps:

1.    List the unassigned shard:

curl -XGET 'domain-endpoint/_cat/shards?h=index,shard,prirep,state,unassigned.reason' | grep UNASSIGNED

Note: If you are using AWS Identity and Access Management (IAM) credentials or database credentials with fine grained access control (FGAC) activated, additional steps are required. Make sure that you sign requests to the OpenSearch Service APIs with your IAM or database user credentials.

2.    Retrieve details for why the shard is unassigned:

curl -XGET 'domain-endpoint/_cluster/allocation/explain?pretty'

3.    (Optional) If using the Kibana or OpenSearch dashboard, run the following API:

GET _cluster/allocation/explain

Failed cluster nodes

If a failed cluster node occurs, then shards might get unassigned. Failed cluster nodes can occur due to high CPU usage on the cluster or hardware failure. To check if the cluster is overloaded, check the CPUUtilization and JVMMemoryPressure metrics. If the cluster is overloaded, you can reduce traffic to the cluster. For instructions, see How do I troubleshoot high JVM memory pressure on my OpenSearch Service cluster?

When the node is up, the shards are re-assigned automatically. If the cluster remains in a red or yellow status, you might receive an error similar to the following:

 "unassigned_info" : {  "reason" : "MANUAL_ALLOCATION",  "at" : "2022-03-18T02:45:42.730Z",  "details" : """failed shard on node [xxxxxxxxxx]: shard  failure, reason [corrupt file (source: [flush])], failure  FlushFailedEngineException[Flush failed]; nested:  CorruptIndexException[codec footer mismatch (file truncated?): actual  footer=0 vs expected footer=-1071082520   """,  "last_allocation_status" : "no_valid_shard_copy"  },  "can_allocate" : "no_valid_shard_copy",  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",

You can delete and restore these indices using snapshots with the following steps:

1.    Identify and delete the red index:

GET _cat/indices?health=red
DELETE /index-name

2.    Check for successful snapshots:

GET _cat/snapshots/cs-automated-enc

3.    Restore an index from snapshots:

POST _snapshot/Repository-name/snapshot-ID/_restore
{
 "indices": "index-name"
}

For more information, see Restoring snapshots.

Misconfigured replica count

If the number of replica shards is more than the number of data nodes, then the shards are unassigned. This is because the primary shard and replica shard can't be allocated on the same node.

To resolve this issue, either increase the number of nodes or reduce the replica count using one of the following commands:

Note: Change the "n" value to your desired value:

curl -XPUT 'domain-endpoint/<indexname>/_settings' -H 'Content-Type: application/json' -d' {  "index" : {  "number_of_replicas" : n  } }
PUT <index-name>/_settings
  {
      "index" : {
        "number_of_replicas" : n
      }
  }

Note: A single node cluster with replica shards always initializes with yellow cluster status. Single node clusters are initialized this way because there are no other available nodes that OpenSearch Service can assign a replica.

Shard failed to get an in-memory lock

If the shard failed to get an in-memory lock for shard allocation, you receive the following error:

"failed_allocation_attempts" : 5,  "details" : "failed shard on node []: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[][5]: obtaining shard lock timed out after 5000ms]; ", . . "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[], failed_attempts[5], delayed=false, details[failed shard on node [xxxxxxxxxxx]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[xxxxxxxxx][5]: obtaining shard lock timed out after 5000ms]; ], allocation_status[no_attempt]]]"

To resolve error, increase the maximum retry settings:

PUT /<index-name>/_settings {  "index.allocation.max_retries": 10 }

Shard limit exceeded

The default limit for the cluster.max_shards_per_node setting is 1,000 shards for OpenSearch Service versions 7.x and later. It's a best practice to keep the cluster.max_shards_per_node setting to the default value of 1,000 shards. If you set shard allocation filters to control how OpenSearch Service allocates shards, the shard can become unassigned from not having enough filtered nodes. To prevent this node shortage, increase your node count:

Note: Change the "n" value to your desired value. Make sure the number of replicas for every primary shard is less than the number of data nodes.

PUT _cluster/settings {  "index" : {  "cluster.max_shards_per_node" : n  } }

For more information, see Choosing the number of shards.

Disk space issues

The disk-based shard allocation settings can also lead to unassigned shards. For example, if the cluster.routing.allocation.disk.watermark.low metric is set to 50 GB, then the specified amount of disk space must be available for shard allocation. For more information, see disk-based shard allocation settings (on the Elasticsearch website).

To check the current disk-based shard allocation settings, use the following syntax:

$ curl -XGET domain-endpoint/_cluster/settings?include_defaults=true&flat_settings=true

To resolve your disk space issues, consider the following approaches:

  • Delete any unwanted indices for yellow and red clusters
  • Delete red indices for red clusters
  • Scale up the EBS volume
  • Add more data nodes

Note: Avoid making any configuration changes to your cluster when it's in a red health status. If you try to reconfigure your domain when it's in a red cluster status, it could get stuck in a "Processing" state.

Disk usage skewed and sharding strategy

Disk usage can be heavily skewed because of the following reasons:

  • Uneven shard sizes in a cluster.
  • Available disk space on a node.
  • Incorrect shard allocation strategy.

By default, Amazon OpenSearch Service has a sharding strategy of 5:1, where each index is divided into five primary shards. Within each index, each primary shard also has its own replica. OpenSearch Service automatically assigns primary shards and replica shards to separate data nodes, and makes sure that there's a backup in case of failure.

You can rebalance the shard allocation in your OpenSearch Service cluster and update your sharding strategy. For more information, see How do I rebalance the uneven shard distribution in my Amazon OpenSearch Service cluster?

ClusterBlockException error

If you tried to create an index or write data to your OpenSearch Service domain, you might receive the ClusterBlockException error similar to the following:

"reason": "blocked by: [FORBIDDEN/6/cluster read-only (api)];",
"type": "cluster_block_exception"

To resolve this error, see How do I resolve the 403 "index_create_block_exception" or "cluster_block_exception" error in OpenSearch Service?

Related information

Troubleshooting Amazon OpenSearch Service

AWS OFFICIAL
AWS OFFICIALUpdated 10 months ago