My Amazon OpenSearch Service cluster turned yellow with the "failed to obtain in-memory shard lock" error message. Why am I receiving this error message, and how do I resolve it?
Short description
If your shard fails to obtain an in-memory lock (within the set thresholds for OpenSearch Service) for shard allocation, you receive the following error message:
"failed_allocation_attempts" : 5,
"details" : "failed shard on node []: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[][5]: obtaining shard lock timed out after 5000ms]; ",
.
.
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[], failed_attempts[5], delayed=false, details[failed shard on node [lga-THKoSXykhSDbghN57A]: failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[evelog-zdn-2020.04.28][5]: obtaining shard lock timed out after 5000ms]; ], allocation_status[no_attempt]]]"
Note: In OpenSearch Service, your cluster must not exceed the time limit (5000ms) and the max number of retries (5) for shard allocation.
To resolve the error message, try the following approaches:
- Troubleshoot your yellow cluster status.
- Increase the maximum retry setting.
- Update the replica count.
Note: It's not a best practice to update replica count for OpenSearch Service clusters with heavy workloads.
Resolution
Troubleshoot your yellow cluster status
Your OpenSearch Service cluster can enter the yellow state because of a node or network failure. If the nodes in your cluster fail because of an internal hardware issue, then the existing nodes are replaced by new nodes. This replacement is an automatic detection feature of OpenSearch Service. However, in some cases, replica shards in the faulty nodes are left unassigned. Replica shards are left unassigned when previously used resources did not free up. During this time, the leader node makes five attempts to allocate the replica shards. If the leader node's five attempts to allocate the replica shards are unsuccessful, then your cluster enters red or yellow health status.
Note: It's a best practice to run the cluster allocation explain API for diagnosing unassigned shards. For more information, see cluster allocation explain API on the Elasticsearch website.
To identify which indices are causing your cluster to enter yellow status, use the following query:
GET /_cat/indices?v&health=yellow
Then, use the following query to identify the root cause of your cluster's unassigned shards:
GET _cluster/allocation/explain
Note: The cluster reroute API isn't recognized by OpenSearch Service. For more information about supported API operations, see Notable API differences.
Increase the maximum retry setting
To return your OpenSearch Service cluster to the green state, increase the maximum number of retries for each yellow index:
PUT /<yellow-index-name>/_settings
{
"index.allocation.max_retries": 10
}
When this API call is run, the leader node retries the shards allocation for a specified index on your cluster.
Note: When you increase the maximum retry setting, shards aren't always automatically assigned. You might have to manually assign the shards.
Update the replica count
Important: Don't use this approach if your OpenSearch Service cluster load is high. When you remove all replicas from an index, the index must rely only on primary shards. If a node goes down, then your cluster might enter red cluster status because the primary shards are left unassigned.
To change your replica count, perform the following steps:
1. Remove any replicas so that the affected index count becomes 0:
PUT /<yellow-index-name>/_settings
{
"index": {
"number_of_replicas": 0
}
}
2. Change the replica count back to the desired count:
PUT /<yellow-index-name>/_settings
{
"index": {
"number_of_replicas": 1
}
}
Related information
Why is my Amazon OpenSearch Service cluster in red or yellow status?
Why did my Amazon OpenSearch Service node crash?
How do I troubleshoot high JVM memory pressure on my Amazon OpenSearch Service cluster?