OpenSearch cluster green but stuck in processing for 2 weeks

0

My OpenSearch cluster has been stuck in processing since the last auto-tune event. the cluster status is green across the board. The cluster is usable without issue (reading, writing, Kibana), but this prevents me from performing an upgrade or applying other config changes.

Monitoring shows:

  • Cluster status green
  • Instance count is 9, as expected: 3 master and 6 data nodes
  • JVM memory pressure looks good: seeing the expected "sawtooth" curve never exceed 75% and going as low as 45%
  • Most recent update was a Service software update to R20211203-P2. It seems to have taken 5 days, but looks like it completely well. (Judging by the instance count graph)
  • The cluster is usable without issue, Kibana is reachable and responsive, constantly writing to the cluster without error, nothing seems off

Rough timeline:

  • 19.12.2021 - update to R20211203-P2, instance count is doubled to 18 (expected blue/green deployment)
  • 24.12.2021 - instance count drops back to the expect 9, cluster status green
  • 26.12.2021 - Notification "Auto-Tune is applying new settings to your domain", instance count doesn't rise, still at 9
  • now - Cluster still stuck at "processing" even though everything is green

What I tried:

  • GET /_cluster/allocation/explain responds with "unable to find any unassigned shards to explain" which makes sense
  • GET /_cat/indices?v shows everything green, as expected
  • I also tried modifying the disk size to try and "kick" the cluster into doing a blue/green deployment and hopefully getting unstuck but that didn't seem to happen

The only possible clue was in CloudWatch error logs, a repeating message appears since the last auto-tune event started on 26.12.2021: with "master not discovered yet", I'll try to pretty-print it below:

[2022-01-11T06:36:23,761][WARN ][o.o.c.c.ClusterFormationFailureHelper] [52cb02d8573b17516f7756d5fe05484d] master not discovered yet: have discovered [

{***}{***}{***}{__IP__}{__IP__}{dir}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, __AMAZON_INTERNAL__, shard_indexing_pressure_enabled=true, __AMAZON_INTERNAL__}, 

{***}{***}{***}{__IP__}{__IP__}{imr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, __AMAZON_INTERNAL__, shard_indexing_pressure_enabled=true, __AMAZON_INTERNAL__}, 

{***}{***}{***}{__IP__}{__IP__}{imr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, __AMAZON_INTERNAL__, shard_indexing_pressure_enabled=true, __AMAZON_INTERNAL__}, 

{***}{***}{***}{__IP__}{__IP__}{imr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, __AMAZON_INTERNAL__, shard_indexing_pressure_enabled=true, __AMAZON_INTERNAL__}, 

{***}{***}{***}{__IP__}{__IP__}{imr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, __AMAZON_INTERNAL__, shard_indexing_pressure_enabled=true, __AMAZON_INTERNAL__}, 

{***}{***}{***}{__IP__}{__IP__}{imr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, __AMAZON_INTERNAL__, shard_indexing_pressure_enabled=true, __AMAZON_INTERNAL__}, 

{***}{***}{***}{__IP__}{__IP__}{imr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, __AMAZON_INTERNAL__, shard_indexing_pressure_enabled=true, __AMAZON_INTERNAL__}

]; 

discovery will continue using [__IP__, __IP__, __IP__, __IP__, __IP__, [__IP__]:9301, [__IP__]:9302, [__IP__]:9303, [__IP__]:9304, [__IP__]:9305, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__] from hosts providers and [] from last-known cluster state; node term 36, last-accepted version 0 in term 0

I masked the node IDs and replaced them with ***. The log message lists 7 of them above, I can only recognize 3 IDs as my master nodes, cannot recognize the rest of the IDs (not my data nodes) and not sure I understand what's going on here. Any help would be appreciated.

ggmabob
asked 3 years ago589 views
1 Answer
0

Thank you for posting your question, for the above scenario you have described, please create a ticket with the AWS support so they can help you in resolving the issue.

AWS
EXPERT
answered 3 years ago
  • Thanks for your response, Fabrizio! One small follow-up question, is there a specific ticket type/category I can choose to make sure it reaches the right team? I should mention I only have the basic support plan, any hint is welcome, thanks again.

  • I notice only now, but it seems that the issue was actually resolve right around the same time you posted this answer, Fabrizio. Did you trigger something internally?

    For all I can tell the issue is completely gone. The "master not discovered yet" message is gone form the logs and the cluster is "Active" again no longer in processing. If that was you, then thank you!

  • Hi , it was not me, just a coincidence :-). happy it was resolved, and sorry I did not see your comment before.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions