Why can't I delete an index or upgrade my OpenSearch Service cluster?

6 minute read
0

I can't delete an index, or upgrade my Amazon OpenSearch Service cluster.

Short description

If you try to delete an index or upgrade your OpenSearch Service cluster, the change can fail for the following reasons:

  • Snapshot is already in progress.
  • Snapshot in progress is stuck.
  • Snapshot in progress has a cluster in red status.
  • Snapshot timeout or failure.

For more information about OpenSearch Service upgrade failures, see Troubleshooting an upgrade.

Resolution

Note: In the following steps, replace domain-endpoint with your OpenSearch Service endpoint. This configuration is based on your domain's use of a virtual private cloud (VPC).

Snapshot is already in progress

If a snapshot is in progress, then you might encounter one of the following error messages:

  • During a cluster upgrade: Prior snapshot operation has not yet completed
  • When you delete an index: Cannot delete indices that are being snapshotted

To resolve these errors, take the following steps.

For encrypted domains, to check if an automated snapshot is in progress, run the following command:

curl -XGET "https://domain-endpoint/_snapshot/cs-automated-enc/_status

For unencrypted domains, to check if an automated snapshot is in progress, run the following command:

curl -XGET "https://domain-endpoint/_snapshot/cs-automated/_status"

If there are no snapshots already in progress, then you receive the following output:

{    "snapshots": []
}

When the brackets are empty, you can safely delete the index or perform an upgrade. If OpenSearch Service can't check if a snapshot is in progress, then the operation fails.

Snapshot in progress is stuck

To check for this issue, complete the following steps:

  1. To check the start and end times of your hourly snapshots, run the following command:

    curl -XGET "https://domain-endpoint/_cat/snapshots/cs-automated?v&s=id"
  2. To print the start times, run a curl output sent to the awk command:

    curl -XGET "https://domain-endpoint/_cat/snapshots/cs-automated?v&s=id" | awk -F" " ' { print $4 } '

    The output of this command indicates the time that hourly snapshots occur. In the following example, the OpenSearch Service takes a snapshot around the 52nd minute of each hour:

    22:51:1123:51:18
    00:51:19
    01:51:14
    02:51:16
    03:51:18
    04:51:16
    05:51:11

    Important: Don't run the upgrade eligibility check until the snapshot is complete.

  3. Use the snapshot status API to check whether the snapshot is complete. For more information, see Get snapshot status on the OpenSearch website. When OpenSearch Service captures your snapshot, the snapshot status API returns an empty set. If the current status is in progress and doesn't change after several minutes, then your snapshot might be stuck or stopped. Stuck and stopped snapshots can prevent the cluster from taking other snapshots. If the cluster is in red status, or there is a write block, then you must clear the status or block to resolve the failure.
    Note: The data from your snapshot can change after you make configuration changes. Don't use the snapshot for scheduled jobs.

Snapshot in progress has a cluster in red status

For more information about the red health status of an OpenSearch Service cluster, see Red cluster status.

To resolve this issue, complete the following steps:

  1. To list the repository names registered to only your domain, run the following command:

    curl -XGET "http://domain-endpoint/_cat/repositories?v&h=id"
  2. To list the repository names, types, and other settings registered to your domain, run the following command:

    curl -XGET "http://domain-endpoint/_snapshot?pretty"
    curl -XGET "https://domain-endpoint/_cluster/state/metadata"
  3. Check if you can list snapshots in each repository, except the cs-automated or cs-automated-enc repositories. To list several repositories, run the following bash script:

    #!/bin/bashrepos=$(curl -s https://domain-endpoint/_cat/repositories 2>&1 |grep  -v "cs-automated" | awk '{print $1}')
    
    for i in $repos; do
    echo "Snapshots in ... :" $i >>/tmp/snapshot
    `curl -s -XGET https://domain-endpoint/_cat/snapshots/$i >> /tmp/snapshot`
    \echo "done..."
    done

    Important: You can't manually delete stuck snapshots in the cs-automated or cs-automated-enc repository.

  4. To view the output in the /tmp/snapshot folder, run the following command:

    cat /tmp/snapshot

    The command returns a response similar to this:

    Snapshots in ... : snapshot-manual-repoaxa_snapshot-1557497454881 SUCCESS 1557639400 05:36:40 1557639405 05:36:45  4.6s  7 31 0 31
    2019-05-15                 SUCCESS 1560503610 09:13:30 1560503622 09:13:42 11.8s  4 16 0 16
    epoch_test                 SUCCESS 1569151317 11:21:57 1569151335 11:22:15 18.1s 15 56 0 56

    The returned error message shows that the Amazon Simple Storage (Amazon S3) bucket is already deleted and registered as a snapshot repository:

    Snapshots in ... : snapshot-manual-repo
    {
      "error": {
        "root_cause": [
          {
            "type": "repository_exception",
            "reason": "[snapshot-manual-repo] could not read repository data from index blob"
          }
        ],
        "type": "repository_exception",
        "reason": "[snapshot-manual-repo] could not read repository data from index blob",
        "caused_by": {
          "type": "i_o_exception",
          "reason": "Exception when listing blobs by prefix [index-]",
          "caused_by": {
            "type": "a_w_s_security_token_service_exception",
            "reason": "a_w_s_security_token_service_exception: User: arn:aws:sts::999999999999:assumed-role/cp-sts-grant-role/swift-us-east-1-prod-666666666666 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::666666666666:policy/my-manual-es-snapshot-creator-policy (Service: AWSSecurityTokenService; Status Code: 403; Error Code: AccessDenied; Request ID: 6b9374fx-11xy-11yz-ff66-918z9bb08193)"
          }
        }
      },
      "status": 500
    }
  5. To verify that you deleted the manual snapshot repository from the Amazon S3 bucket, run the following command:

    aws s3 ls | grep -i "snapshot-manual-repo"

    Note: Replace snapshot-manual-repo with your bucket name.

  6. To delete the repository from your cluster, run the following command:

    curl -XDELETE "https://domain-endpoint/_snapshot/snapshot-example-manual-repo"

    Note: Replace snapshot-example-manual-repo with your snapshot name.

Snapshot timeout or failure

If you get a snapshot timeout or failure, then complete the following steps:

Check if you can take a manual snapshot. If you get a Can't take manual snapshot error, then run the _cat/snapshots API:

curl -XGET "https://domain-endpoint/_cat/snapshots/s3_repository"

Note: Replace s3_repository with the name of your Amazon S3 bucket.

This syntax checks the amount of time that the current snapshot has been running. If the duration is within an expected time frame, wait for the snapshot to complete. Then, take the snapshot again. Snapshot duration depends on the size of your indices and the resource consumption of your cluster.

To check the health status of your cluster, run the following command:

curl -XGET "https://domain-endpoint/_cluster/health?pretty"

If your cluster's health status is red, identify and address the root cause of your red cluster status. If OpenSearch Service is currently relocating or initializing shards, first wait for the process to complete. Then, configure your access policies. Shard reallocation can significantly strain the computing resources of your cluster.

Related information

How can I improve the indexing performance on my Amazon OpenSearch Service cluster?

AWS OFFICIAL
AWS OFFICIALUpdated 10 months ago