Red ElasticSearch 5.5 cluster due to NODE_LEFT with running snapshot
There is a cluster that, due to losing a couple of nodes has a single shard in UNASSIGNED state.
TL;DR;: The shard can not be rerouted due to AWS limitations, index can not be deleted due to running snapshot (for over 18 hours now), cluster has scaled to double its regular size for no obvious reason and snapshot can not be cancelled because it is one of the automated ones.
What could be done to get the cluster back to green health? Data loss of that single index should not be a problem.
Detailed explanation
Symptom
Cluster in red health status due to a single unnasigned shard. A call to /_cluster/allocation/explain
returns the following:
{
"index": "REDACTED",
"shard": 1,
"primary": true,
"current_state": "unassigned",
"unassigned_info": {
"reason": "NODE_LEFT",
"at": "2021-12-01T21:27:04.905Z",
"details": "node_left[REDACTED]",
"last_allocation_status": "no_valid_shard_copy"
},
"can_allocate": "no_valid_shard_copy",
"allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
...
Cluster rerouting
Regular troubleshooting on the matter indicates that one could take the data loss by reallocating the shard to empty using something like:
$ curl -XPOST '/_cluster/reroute' -d '{"commands": [{ "allocate_empty_primary": { "index": "REDACTED", "shard": 1, "node": "REDACTED", "accept_data_loss": true }}] }'
{"Message":"Your request: '/_cluster/reroute' is not allowed."}
But that endpoint is not available in AWS.
Closing/Deleting the index
Other suggestions include closing the index for operations, but that is not supported by AWS:
$ curl -X POST '/REDACTED/_close'
{"Message":"Your request: '/REDACTED/_close' is not allowed by Amazon Elasticsearch Service."}
Another solution is to delete the index. But, as there is a running snapshot, it can not be deleted:
$ curl -X DELETE '/REDACTED'
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[REDACTED][indices:admin/delete]"}],"type":"illegal_argument_exception","reason":"Cannot delete indices that are being snapshotted: [[REDACTED]]. Try again after snapshot finishes or cancel the currently running snapshot."},"status":400}
Cancelling the snapshot
As the previous error message states, you can try cancelling the snapshot:
curl -X DELETE '/_snapshot/cs-automated-enc/REDACTED'
{"Message":"Your request: '/_snapshot/cs-automated-enc/REDACTED' is not allowed."}
Apparently that is because the snapshot is part of the automated ones. Had it been a manual snapshot I would have been able to cancel it.
Problem is that the snapshot has been running for over 10 hours and is still initializing:
$ curl '/_snapshot/cs-automated-enc/REDACTED/_status'
{ "snapshots": [
{
"snapshot": "2021-12-12t20-38-REDACTED",
"repository": "cs-automated-enc",
"uuid": "REDACTED",
"state": "INIT",
"shards_stats": {
"initializing": 0,
"started": 0,
"finalizing": 0,
"done": 0,
"failed": 0,
"total": 0
},
"stats": {
"number_of_files": 0,
"processed_files": 0,
"total_size_in_bytes": 0,
"processed_size_in_bytes": 0,
"start_time_in_millis": 0,
"time_in_millis": 0
},
"indices": {}
}
]}
As it can be seen from the timestamp, it has been that way for almost 20 hours now (for reference, previous snapshots show to have run in a couple of minutes).
This seems a case in which AWS Support should be able to help you.
Your best next steps would be to open a Support case, in case you have not yet a Support subscription, consider to upgrade to Developer Support or Business support depending on the urgency of your case.
While I agree that this is an issue for AWS to look into, we did pay for support in the past and stopped doing so after 3 years of not opening a single ticket with them. I don't think that I should have to pay up to 10% for the privilege of opening a bug report to them.
@Matias, while I hear you, I am just stating that to solve this issue you best path is to (even if temporarily) upgrade your support tier again even at developer tier (29$ or 3%).
@Fabrizio I understand, but do you think it would be reasonable to have to pay a few thousand USD (which would be a single 30-day period on developer support tier) to have AWS look at something that is evidently an issue on their side that we did nothing to trigger?
Relevant questions
OpenSearch cross-cluster search and autotune
asked 3 months agoEMR in 2 AZs and High Availability
Accepted Answerasked 2 years agoRedis cluster with multiple shards
asked 3 years agoRed ElasticSearch 5.5 cluster due to NODE_LEFT with running snapshot
asked 5 months agoElastiCache multi-AZ setup with no replica
Accepted AnswerOpensearch upgrade stuck
asked 2 months agoMixed Fleet configuration for Elastic memcache Cluster
asked 15 days agoUse RDS Postgres Replicas as a cluster
Accepted Answerasked 4 months agoElasticSearch Domain not responsive anymore
asked 4 months agoAdding new shards didn't increase the total available memory
asked 2 years ago
Update: after the latest AWS outage in EC2, the snapshot was cancelled which allowed us to delete the index with the unallocated shard and the cluster is back in a healthy status :)