Skip to content

Why did my OpenSearch Service node crash?

3 minute read
0

One of the nodes in my Amazon OpenSearch Service cluster is down. Or, my OpenSearch Service nodes keep crashing.

Resolution

Failed cluster nodes might occur because high Java virtual machine (JVM) pressure or high CPU usage overload the node. Cluster node failure also occurs when hardware failures cause health check failures.

Check for failed nodes

Complete the following steps:

  1. Open the OpenSearch Service console.
  2. In the navigation pane, under Managed clusters, choose Domains.
  3. Select your OpenSearch Service domain.
  4. Choose the Cluster health tab, and then choose Nodes. If the number of nodes is fewer than the number that you configured for your cluster, then a node is down.
    Note: The Nodes metric might be inaccurate during changes to your cluster configuration or routine maintenance for the service. This behavior is expected.

Identify and troubleshoot overloaded nodes

High CPU and JVM pressure can cause nodes to drop from the cluster because of high traffic. When a node can't manage the load, it can become unresponsive and crash.

To troubleshoot this issue, reboot the node. Make sure that you adhere to the node reboot requirements.

If you still encounter issues, then check and reduce the CPU utilization and JVM memory pressure on your OpenSearch Service cluster.

Identify and troubleshoot hardware failure issues

Hardware failures can affect the availability of cluster nodes. OpenSearch Service performs periodic health checks on each node. If a node fails its health checks, then OpenSearch Service allows it to rejoin the cluster, or automatically replaces it with a new, healthy node.

Use replication to reduce the risk of data loss

Run the following command to activate replicas for your indices to serve as a backup in case OpenSearch Service replaces a node that crashed:

curl -XPUT 'domain-endpoint/indexname/_settings' -H 'Content-Type: application/json' -d'{ "index" : { "number_of_replicas" : 0 }}

Note: Replace domain-endpoint with your domain endpoint and indexname with your index name.

Replica shards provide data redundancy and allow the cluster to continue to serve requests even if a primary shard becomes unavailable. It's a best practice to configure at least one replica for each index. Multi-node clusters without replica shards are at risk of data loss. For more information, see Sizing Amazon OpenSearch Service domains.

It's a best practice to use more than one data node in each cluster. You can't use replica shards for single-node clusters because you can't assign primary and replica shards to the same node. If the node crashes, then you experience data loss. This occurs even if you activated fine-grained access control for your cluster. If your single-node cluster crashes, then use an index snapshot to restore the lost data.

Important: You can only recover the data that you captured in your most recent snapshot.

Configure a Multi-AZ domain

When you configure a Multi-AZ domain, OpenSearch Service launches data nodes in multiple Availability Zones. OpenSearch Service distributes primary shards and their corresponding replica shards to different Availability Zones. If there's a failure in one node or zone, then your data is still available.

Related information

Operational best practices for Amazon OpenSearch Service

How do I improve the fault tolerance of my OpenSearch Service domain?

How do I scale up or scale out an OpenSearch Service domain?

Why is my OpenSearch Service domain stuck in the "Modifying" state?

AWS OFFICIALUpdated 4 months ago