OpenSearch cluster green but stuck in processing for 2 weeks
My OpenSearch cluster has been stuck in processing since the last auto-tune event. the cluster status is green across the board. The cluster is usable without issue (reading, writing, Kibana), but this prevents me from performing an upgrade or applying other config changes.
Monitoring shows:
- Cluster status green
- Instance count is 9, as expected: 3 master and 6 data nodes
- JVM memory pressure looks good: seeing the expected "sawtooth" curve never exceed 75% and going as low as 45%
- Most recent update was a Service software update to R20211203-P2. It seems to have taken 5 days, but looks like it completely well. (Judging by the instance count graph)
- The cluster is usable without issue, Kibana is reachable and responsive, constantly writing to the cluster without error, nothing seems off
Rough timeline:
- 19.12.2021 - update to R20211203-P2, instance count is doubled to 18 (expected blue/green deployment)
- 24.12.2021 - instance count drops back to the expect 9, cluster status green
- 26.12.2021 - Notification "Auto-Tune is applying new settings to your domain", instance count doesn't rise, still at 9
- now - Cluster still stuck at "processing" even though everything is green
What I tried:
GET /_cluster/allocation/explain
responds with "unable to find any unassigned shards to explain" which makes senseGET /_cat/indices?v
shows everything green, as expected- I also tried modifying the disk size to try and "kick" the cluster into doing a blue/green deployment and hopefully getting unstuck but that didn't seem to happen
The only possible clue was in CloudWatch error logs, a repeating message appears since the last auto-tune event started on 26.12.2021: with "master not discovered yet", I'll try to pretty-print it below:
[2022-01-11T06:36:23,761][WARN ][o.o.c.c.ClusterFormationFailureHelper] [52cb02d8573b17516f7756d5fe05484d] master not discovered yet: have discovered [
{***}{***}{***}{__IP__}{__IP__}{dir}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, __AMAZON_INTERNAL__, shard_indexing_pressure_enabled=true, __AMAZON_INTERNAL__},
{***}{***}{***}{__IP__}{__IP__}{imr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, __AMAZON_INTERNAL__, shard_indexing_pressure_enabled=true, __AMAZON_INTERNAL__},
{***}{***}{***}{__IP__}{__IP__}{imr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, __AMAZON_INTERNAL__, shard_indexing_pressure_enabled=true, __AMAZON_INTERNAL__},
{***}{***}{***}{__IP__}{__IP__}{imr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, __AMAZON_INTERNAL__, shard_indexing_pressure_enabled=true, __AMAZON_INTERNAL__},
{***}{***}{***}{__IP__}{__IP__}{imr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, __AMAZON_INTERNAL__, shard_indexing_pressure_enabled=true, __AMAZON_INTERNAL__},
{***}{***}{***}{__IP__}{__IP__}{imr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, __AMAZON_INTERNAL__, shard_indexing_pressure_enabled=true, __AMAZON_INTERNAL__},
{***}{***}{***}{__IP__}{__IP__}{imr}{dp_version=20210501, distributed_snapshot_deletion_enabled=false, cold_enabled=false, adv_sec_enabled=false, __AMAZON_INTERNAL__, shard_indexing_pressure_enabled=true, __AMAZON_INTERNAL__}
];
discovery will continue using [__IP__, __IP__, __IP__, __IP__, __IP__, [__IP__]:9301, [__IP__]:9302, [__IP__]:9303, [__IP__]:9304, [__IP__]:9305, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__, __IP__] from hosts providers and [] from last-known cluster state; node term 36, last-accepted version 0 in term 0
I masked the node IDs and replaced them with ***
. The log message lists 7 of them above, I can only recognize 3 IDs as my master nodes, cannot recognize the rest of the IDs (not my data nodes) and not sure I understand what's going on here. Any help would be appreciated.
Thank you for posting your question, for the above scenario you have described, please create a ticket with the AWS support so they can help you in resolving the issue.
I notice only now, but it seems that the issue was actually resolve right around the same time you posted this answer, Fabrizio. Did you trigger something internally?
For all I can tell the issue is completely gone. The "master not discovered yet" message is gone form the logs and the cluster is "Active" again no longer in processing. If that was you, then thank you!
Hi , it was not me, just a coincidence :-). happy it was resolved, and sorry I did not see your comment before.
Relevant questions
What could I do when the amazon opensearch instance is stuck in update for more than a week?
asked 5 months agoOpenSearch cross-cluster search and autotune
asked 3 months agoAurora Cluster stuck in "Creating" status
asked 3 years agoUpgrade ElasticSearch to Opensearch taking 2 months
asked 3 months agoOpensearch 3 node cluster's diskspace was upgraded from 500GB to 600GB. The copying of shards to new nodes became stuck.
asked 21 days agoOpensearch upgrade stuck
asked 2 months agoEKS Cluster stuck in updating
asked 2 months agoAWS Opensearch cluster is in "processing"
Accepted Answerasked 4 months agoOpenSearch cluster green but stuck in processing for 2 weeks
asked 4 months agoOpenSearch cluster stuck in creation
Accepted Answerasked 23 days ago
Thanks for your response, Fabrizio! One small follow-up question, is there a specific ticket type/category I can choose to make sure it reaches the right team? I should mention I only have the basic support plan, any hint is welcome, thanks again.