Why is my OpenSearch Service domain stuck in the "Modifying" state?

5 minute read
2

I want to troubleshoot my Amazon OpenSearch Service cluster that's stuck in the "Modifying" state.

Resolution

Note: If you encounter errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

When you make configuration changes, your OpenSearch Service cluster enters the Modifying state. Configuration changes include when you add new data nodes, provision input/output operations per second (IOPS), or set up AWS Key Management Service (AWS KMS) keys.

Note: It's a best practice to check whether your cluster supports blue/green deployment before you submit a configuration change. Perform a dry-run before you submit configuration changes.

A validation check fails with errors

When you initiate a configuration change, OpenSearch Service performs validation checks to make sure that your domain is eligible for an upgrade. If validation fails, then your domain remains in the Modifying state. To resolve this issue, complete the troubleshooting steps for your error. Then, retry your configuration change.

A new set of resources fails to launch

If you submit multiple configuration changes simultaneously, then your cluster can get stuck. When you submit a configuration change, wait until the change completes before you submit another configuration change.

Validation checks completed in the Validation stage remain valid for the duration of the configuration change. If your configuration passes the Validation stage, then don't modify the resources that your domain requires. For example, don't deactivate the AWS KMS key you use for encryption.

Shard migration to the new set of data nodes doesn't complete

After OpenSearch Service creates the new resources, the shard migration from the old set of data nodes to the new set begins. This stage can take several minutes to several hours based on the cluster load and size.

To monitor the current migration of shards between the old nodes and the new nodes, use the following API operation:

GET /DOMAIN_ENDPOINT/_cat/recovery?active_only=true

Note: Replace DOMAIN_ENDPOINT with your domain endpoint.

If your OpenSearch Service cluster is in red cluster status, then the shard migration fails. To troubleshoot your red health status, see Why is my Amazon OpenSearch Service cluster in a red or yellow status?

When you overload your cluster, the cluster can't allocate resources to handle the shard migration. A cluster with high CPU and JVM pressure can overload. To troubleshoot this issue, monitor the JVMMemoryPressure and CPUUtilization Amazon CloudWatch metrics.

If there's a lack of free storage space in the new set of nodes, then shard migration can fail. This issue can occur when you add new data to the cluster during the blue/green deployment process. This issue also occurs when old nodes have large shards that OpenSearch Service can't allocate to the new nodes.

To free up storage, use the delete index API operation to delete old indexes that you no longer need. For more information, see Delete index API on the Elastic website.

To view the size of your shards, use the cat shards API operation. Then, to view each node's number of assigned shards, use the cat allocation API operation. If the new nodes don't have all of the required shards, then use the cluster allocation explain API operation to identify the cause. For more information, see cat shards API, cat allocation API, and Cluster allocation explain API on the Elastic website.

If your shard exceeds the maximum number of retries and remains unassigned to a node, then retry the allocation.

By default, the cluster tries to allocate a shard a maximum of 5 times in a row. To increase the index.allocation.max_retries index setting for the shard, use the following API operation:

PUT INDEX_NAME/_settings  
{
    "index.allocation.max_retries" : 10
}

Note: Replace INDEX_NAME with your index name.

Internal hardware failures can cause shards on old data nodes to get stuck during migration. Based on your hardware issue, OpenSearch Service runs self-healing scripts to return the nodes to a healthy state.

When you pin shards to an older set of nodes, a stuck shard relocation can occur. To make sure that you don't have shards pinned to any nodes, check the index settings. Or, check to see if your cluster has a ClusterBlockException error.

To identify the shards that can't be allocated to the new nodes and the corresponding index settings, run the following commands:

GET /DOMAIN_ENDPOINT/_cluster/allocation/explain?pretty
GET /DOMAIN_ENDPOINT/INDEX_NAME/_settings?pretty

Note: Replace DOMAIN_ENDPOINT and INDEX_NAME with your values.

Check whether the following settings appear in the index settings output:

  • "index.routing.allocation.require._name": "NODE_NAME"
  • "index.blocks.write": true

If you see "index.routing.allocation.require._name": "NODE_NAME" in your index settings, then run the following command to reset the setting:

PUT /DOMAIN_ENDPOINT/INDEX_NAME/_settings  
{
    "index.routing.allocation.require._name": null
}

Note: Replace DOMAIN_ENDPOINT and INDEX_NAME with your values.

For more information, see Index-level shard allocation filtering on the Elastic website.

If you see "index.blocks.write": true in your index settings, then your index has a write block. This write block issue can occur because of a ClusterBlockException error. For more information, see How do I resolve the 403 "index_create_block_exception" or "cluster_block_exception" error in OpenSearch Service?

To monitor the progress of your configuration change, run the DescribeDomainChangeProgress API operation.

For clusters stuck in the Modifying state or domains stuck in the Deleting older resources state for over 24 hours, contact AWS Support.

AWS OFFICIALUpdated a month ago