Red ElasticSearch 5.5 cluster due to NODE_LEFT with running snapshot

0

There is a cluster that, due to losing a couple of nodes has a single shard in UNASSIGNED state.

TL;DR;: The shard can not be rerouted due to AWS limitations, index can not be deleted due to running snapshot (for over 18 hours now), cluster has scaled to double its regular size for no obvious reason and snapshot can not be cancelled because it is one of the automated ones.

What could be done to get the cluster back to green health? Data loss of that single index should not be a problem.

Detailed explanation

Symptom

Cluster in red health status due to a single unnasigned shard. A call to /_cluster/allocation/explain returns the following:

{
    "index": "REDACTED",
    "shard": 1,
    "primary": true,
    "current_state": "unassigned",
    "unassigned_info": {
        "reason": "NODE_LEFT",
        "at": "2021-12-01T21:27:04.905Z",
        "details": "node_left[REDACTED]",
        "last_allocation_status": "no_valid_shard_copy"
    },
    "can_allocate": "no_valid_shard_copy",
    "allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
    ...

Cluster rerouting

Regular troubleshooting on the matter indicates that one could take the data loss by reallocating the shard to empty using something like:

$ curl -XPOST '/_cluster/reroute' -d '{"commands": [{ "allocate_empty_primary": { "index": "REDACTED", "shard": 1, "node": "REDACTED",  "accept_data_loss": true  }}]  }'
{"Message":"Your request: '/_cluster/reroute' is not allowed."}

But that endpoint is not available in AWS.

Closing/Deleting the index

Other suggestions include closing the index for operations, but that is not supported by AWS:

$ curl -X POST '/REDACTED/_close'
{"Message":"Your request: '/REDACTED/_close' is not allowed by Amazon Elasticsearch Service."}

Another solution is to delete the index. But, as there is a running snapshot, it can not be deleted:

$ curl -X DELETE '/REDACTED'
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[REDACTED][indices:admin/delete]"}],"type":"illegal_argument_exception","reason":"Cannot delete indices that are being snapshotted: [[REDACTED]]. Try again after snapshot finishes or cancel the currently running snapshot."},"status":400}

Cancelling the snapshot

As the previous error message states, you can try cancelling the snapshot:

curl -X DELETE '/_snapshot/cs-automated-enc/REDACTED'
{"Message":"Your request: '/_snapshot/cs-automated-enc/REDACTED' is not allowed."}

Apparently that is because the snapshot is part of the automated ones. Had it been a manual snapshot I would have been able to cancel it.

Problem is that the snapshot has been running for over 10 hours and is still initializing:

$ curl '/_snapshot/cs-automated-enc/REDACTED/_status'
{ "snapshots": [
    {
        "snapshot": "2021-12-12t20-38-REDACTED",
        "repository": "cs-automated-enc",
        "uuid": "REDACTED",
        "state": "INIT",
        "shards_stats": {
            "initializing": 0,
            "started": 0,
            "finalizing": 0,
            "done": 0,
            "failed": 0,
            "total": 0
        },
        "stats": {
            "number_of_files": 0,
            "processed_files": 0,
            "total_size_in_bytes": 0,
            "processed_size_in_bytes": 0,
            "start_time_in_millis": 0,
            "time_in_millis": 0
        },
        "indices": {}
    }
]}

As it can be seen from the timestamp, it has been that way for almost 20 hours now (for reference, previous snapshots show to have run in a couple of minutes).

  • Update: after the latest AWS outage in EC2, the snapshot was cancelled which allowed us to delete the index with the unallocated shard and the cluster is back in a healthy status :)

Matias
asked 2 years ago889 views
1 Answer
-1

This seems a case in which AWS Support should be able to help you.

Your best next steps would be to open a Support case, in case you have not yet a Support subscription, consider to upgrade to Developer Support or Business support depending on the urgency of your case.

https://aws.amazon.com/premiumsupport/plans/

AWS
EXPERT
answered 2 years ago
  • While I agree that this is an issue for AWS to look into, we did pay for support in the past and stopped doing so after 3 years of not opening a single ticket with them. I don't think that I should have to pay up to 10% for the privilege of opening a bug report to them.

  • @Matias, while I hear you, I am just stating that to solve this issue you best path is to (even if temporarily) upgrade your support tier again even at developer tier (29$ or 3%).

  • @Fabrizio I understand, but do you think it would be reasonable to have to pay a few thousand USD (which would be a single 30-day period on developer support tier) to have AWS look at something that is evidently an issue on their side that we did nothing to trigger?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions