Skip to content

OpenSearch domain nodes out of service - connection refused on port 443

0

Region: ap-south-1 OpenSearch Version: 3.3
** Issue**: Our OpenSearch domain is completely unreachable. The cluster health shows "-" in the AWS console and the domain is not accepting any connections.

What we have tried:

  • Curl from within the VPC returns "Connection refused" on port 443
  • SSM port forwarding tunnel opens successfully but every connection to port 443 fails
  • Tags operation returns: "DP Nodes are OOS, Tags operation is not allowed"
  • Rollback failed with "No previous software version available for rollback"
  • Domain status shows Active but cluster is unresponsive
  • Attempted to apply service software update but no update option was available due to OOS state

Curl output from within the VPC:

  • Trying [private-ip]:443...

  • connect to [private-ip] port 443 failed: Connection refused

  • Failed to connect to [opensearch-endpoint] port 443

What is working:

  • DNS resolves correctly to the VPC private IP
  • EC2 node appears up (Nodes=1 in CloudWatch)
  • All other application services running normally
  • Only OpenSearch is unreachable

Question: Is there any way to recover the OpenSearch cluster without upgrading the AWS support plan? Has anyone faced "DP Nodes are OOS" and recovered without contacting AWS support?

asked 4 days ago18 views
2 Answers
2

As to my understanding, the previous recommendation overlooks a critical detail in your metrics:

your cluster has Nodes=1. Because this is a Single-Node setup, administrative node restarts will not work (they require > 3 nodes), and the SSM runbook will fail because Port 443 is completely dead.

"DP Nodes are OOS" on a single node means the underlying instance or Java process crashed completely (likely due to an Out-of-Memory event). Since the domain state is still "Active", your only self-service option is to force AWS to provision a new underlying host by triggering a Blue/Green deployment.

Try applying one of these configuration changes via the Console or AWS CLI:

  • Instance Type Modification: Change the instance type minimally (e.g., from t3.medium.search to another supported instance family/size and back).

  • Storage Adjustment: Increase the EBS volume size by a few gigabytes to force a volume modification and host re-evaluation.

  • Toggle Dedicated Master Nodes: Temporarily enable Dedicated Master nodes (or toggle them off if enabled) to fundamentally shake up and rebuild the cluster topology.

Note: If the AWS Control Plane blocks these modifications with a 400 Bad Request or an internal error due to the OOS state, you cannot recover this yourself. If upgrading to a paid support plan is absolutely off the table, your only option is to delete and recreate the domain from a snapshot, or wait and hope that the automated AWS hypervisor health checks eventually cycle the unresponsive hardware.

EXPERT
answered 4 days ago
0

When your OpenSearch domain shows "DP Nodes are OOS" (Data Plane Nodes are Out of Service) and the cluster is completely unresponsive with connection refused errors, this indicates a critical issue where the OpenSearch process on your data nodes has stopped or failed.

Here are some steps you can take:

Immediate Actions:

  1. Restart the OpenSearch Process: Amazon OpenSearch Service provides administrative options to restart the OpenSearch process on a data node. You can perform this action through the AWS Management Console, AWS CLI, or AWS SDK. However, note that your domain must not be in a red state and all indexes must have replicas configured before initiating a restart.

  2. Restart the Data Node: If restarting the process doesn't resolve the issue, you can restart the data node itself. This is a more aggressive action but may be necessary when the OpenSearch process is completely unresponsive.

Important Prerequisites:

  • Your domain needs at least three or more data nodes for process and node restarts to be supported
  • Multi-AZ with Standby configurations are not supported for these administrative actions
  • These options are available for Elasticsearch versions 7.x and higher

Diagnostic Approach:

You can use the AWSSupport-TroubleshootOpenSearchRedYellowCluster automation runbook through AWS Systems Manager to help diagnose the issue. This runbook can identify causes of cluster health problems and provide recommended resolution steps. It works by creating a Lambda function that runs OpenSearch APIs against your cluster to determine the root cause.

If Administrative Actions Don't Work:

Given that your domain shows as "Active" but is completely unresponsive, and rollback has failed, this suggests a severe underlying issue. The "DP Nodes are OOS" message combined with connection refused errors typically indicates the data plane is not functioning at all.

Unfortunately, when administrative restart options are unavailable or ineffective, and the domain is in this critical state, you may need to work with AWS Support through the AWS Support Center (https://console.aws.amazon.com/support/) to add correspondence about your situation. You might also want to review your AWS Support plan options, as higher-tier plans provide faster response times for production-impacting issues like this.

Prevention for Future:

answered 4 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.