Skip to content

OpenSearch service software update stuck in "Applying changes" for more than 6 months

0

Summary

A service-initiated mandatory OpenSearch software update on one of our
domains has been stuck in DomainProcessingStatus: UpdatingServiceSoftware for an extended period (many months). The cluster itself has been
healthy throughout. I'd like to either force-complete or force-cancel the
deployment so I can run normal update-domain-config operations again.

Domain shape (general)

  • Engine: OpenSearch 2.15

What's stuck

describe-domain-change-progress for the current ChangeId shows the
blue/green deployment got through 4 of 5 stages and froze on stage 4:

StageStatus
ValidationCOMPLETED
Creating a new environmentCOMPLETED
Provisioning new nodesCOMPLETED
Copying shards to new nodesIN_PROGRESS — frozen part-way through (a small number of shards never copied to the new nodes).
LastUpdatedTime hasn't advanced in a long time.
Deleting older resourcesPENDING

ServiceSoftwareOptions:

UpdateStatus: IN_PROGRESS
Cancellable: false
OptionalDeployment: false UpdateAvailable: false (because a deployment is already in progress)

The CurrentVersion is a service software patch from a long time ago,
suggesting earlier service-initiated updates also stalled and retried —
this looks like a recurring stuck-deployment pattern on this domain
rather than a one-off.

What's actually happening on the cluster

The shard-copy stall self-resolved at some point — _cluster/health
right now.

Ingest is flowing, search is fine, ISM retention works, no ClusterIndexWritesBlocked. The remaining symptoms are:

  • DomainProcessingStatus permanently reads UpdatingServiceSoftware.
  • update-domain-config calls are rejected with "Domain is being processed".
  • The 4-node count (vs the configured InstanceCount of 2) means the
    blue/green deployment never tore down the old environment.

What I've tried

  • Waited a long time.

  • Verified the cluster is healthy via OpenSearch APIs.

  • Confirmed Cancellable: false, so I can't cancel the deployment myself.

  • Confirmed UpdateAvailable: false while a deployment is in progress,
    so I can't kick a fresh patch over the top.

  • Did not attempt any update-domain-config — would be rejected, and
    I don't want to confuse the state machine further.

Is there anything I can do from the customer side to nudge the
deployment to completion (or cancellation), or does this require AWS-side intervention?

asked 16 days ago41 views
1 Answer
0

Based on the situation you've described, this requires AWS-side intervention. Unfortunately, you cannot cancel an upgrade or service software update after it has been initiated - you must wait for AWS to resolve the underlying issue before the process can complete.

Since your domain has been stuck in the "UpdatingServiceSoftware" state for an extended period (months) during the shard relocation phase of the blue/green deployment, and you've already verified that:

  • The cluster is healthy
  • Shards have actually completed copying
  • The update is marked as non-cancellable
  • You cannot perform any domain configuration updates

This is clearly an issue with the OpenSearch Service control plane not properly detecting that the deployment has completed and failing to proceed to the final stage of deleting the older resources.

The fact that you have 4 nodes instead of your configured 2 nodes confirms that the blue/green deployment never completed its cleanup phase, leaving both the old and new environments running.

You should contact AWS Support to resolve this issue. They will need to:

  1. Investigate why the deployment state machine failed to progress beyond the shard copying stage
  2. Manually complete or cancel the stuck deployment
  3. Clean up the orphaned resources from the incomplete blue/green deployment
  4. Restore your ability to perform domain configuration updates

Given that you mentioned this appears to be a recurring pattern on this domain, AWS Support should also investigate the root cause to prevent future occurrences of this issue during service software updates.
Sources
Troubleshoot stuck or failed upgrades in OpenSearch Service | AWS re:Post

answered 16 days ago
AWS
SUPPORT ENGINEER
reviewed 15 days ago
EXPERT
reviewed 16 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.