Skip to content

OpenSearch blue/green deployment stuck at cleanup phase - 6 nodes instead of target 2

0

Issue: Blue/green deployment stuck during node cleanup phase

Domain Details:

  • Domain: opensearch
  • Region: us-east-1
  • Account: 028162106728
  • Change ID: 8f40c2ca-d95c-4803-a21e-b048d4da4490

Current Status:

  • Console shows: "Copying shards to new nodes - 2994 of 2997 completed"
  • Actual cluster state: 6 nodes running (should be 2)
  • All shards are STARTED and healthy
  • No pending cluster tasks
  • Cluster is green and fully functional

Target Configuration:

  • Instance count: 2 nodes
  • EBS: 4TB gp3 with 25,000 IOPS
  • Zone awareness: Disabled

Evidence:

  1. _cat/shards shows all shards in STARTED state
  2. _cluster/pending_tasks returns empty array
  3. _cat/nodes shows 6 healthy nodes with 1 master
  4. Applications working normally

Request: Please manually complete the blue/green deployment cleanup phase. The deployment appears functionally complete but AWS console tracking is stuck.

Timeline: Deployment started ~24 hours ago, stuck at same progress for 12+ hours.

asked 2 months ago40 views
1 Answer
1

I understand your OpenSearch Service domain is stuck in a blue/green deployment with 6 nodes running instead of the target 2 nodes, even though shard migration appears complete and the cluster is healthy.

This situation occurs when the domain enters a "Processing" or "Modifying" state during configuration changes. During blue/green deployments, OpenSearch Service creates a new environment with new nodes and migrates shards from the old nodes to the new ones. The domain can get stuck in this state if:

  • The shard relocation process encounters issues, even if shards appear healthy
  • Validation checks fail with errors
  • The system is unable to properly terminate the old nodes after migration

Several factors can cause shard relocation to stall during blue/green deployments:

  • Large shard sizes (over 50 GB) that increase copy time
  • High index or search load causing elevated resource usage
  • Insufficient disk space on new nodes
  • Reaching shard count quotas

To help diagnose the root cause, I recommend:

  1. Monitor CloudWatch metrics for your domain, specifically:
  • FreeStorageSpace
  • ClusterStatus
  • ClusterIndexWritesBlocked
  • CPUUtilization
  • JVMMemoryPressure
  1. Check the domain description in the console or via API for any failed activities or validation-related error messages

  2. Monitor shard migration using the _cat/recovery API to see if any recovery operations are still in progress

Since you've confirmed all shards are STARTED and there are no pending tasks, this suggests the migration completed but the cleanup phase failed to proceed. Unfortunately, I cannot manually intervene to complete the deployment cleanup, as this requires AWS Support access to your account's internal systems.

I strongly recommend opening an AWS Support case with the details you've provided (domain name, region, account ID, and Change ID). The support team can investigate the stuck deployment state and safely complete the cleanup process to remove the old nodes and finalize the configuration change.

In the meantime, avoid making additional configuration changes to the domain, as this could complicate the current stuck state.
Sources
Troubleshooting Amazon OpenSearch Service - Amazon OpenSearch Service
Troubleshoot stuck or failed upgrades in OpenSearch Service | AWS re:Post

answered 2 months ago
EXPERT
reviewed 2 months ago
EXPERT
reviewed 2 months ago
  • Regarding the former answer of the re:Post Agent, it is accurate. However, since the cluster state is green and shards are internally reported as STARTED, this is clearly a stuck workflow in the AWS Control Plane. The suggested CloudWatch metrics are helpful for verification, but as the re:Post Agent noted, only AWS Support can manually clear the Change ID and terminate the legacy nodes.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.