How can I troubleshoot cross-cluster replication failures on my Amazon OpenSearch Service cluster?
My cross-cluster replication isn’t working on my Amazon OpenSearch Service cluster.
Description
You can set up a cross-cluster connection to replicate indexes from one domain to another. Before you begin, make sure that you adhere to the limitations, prerequisites, and permissions requirements.
Note: cross-cluster replication doesn't work with data streams. For more information, see data streams on the OpenSearch website.
Resolution
Follow these troubleshooting steps for your use case.
Note: If you activated OpenSearch Service error logs, you can get additional troubleshooting information. For more information, see Viewing OpenSearch Service error logs.
Check the replication task status
-
Check the progress of the bootstrapping state using the following command:
GET _cat/recovery?active_only=true
Example output:
index shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent test-follower-index 0 1.8s snapshot index n/a n/a x.x.x.x d76fd4d86d2307b6xxxxxx replication-remote-repo-connection1 replication-remote-snapshot 24 24 100.0% 24 1596356 1596356 100.0% 1596356 0 0 100.0% test-follower-index 1 2.8s snapshot index n/a n/a x.x.x.x 9ab1495309b8e53axxxxxx replication-remote-repo-connection1 replication-remote-snapshot 24 24 100.0% 24 1596356 1596356 100.0% 1596356 0 0 100.0% test-follower-index 2 1.8s snapshot index n/a n/a x.x.x.x d76fd4d86d2307b6xxxxxx replication-remote-repo-connection1 replication-remote-snapshot 24 24 1000.0% 24 1596356 1596356 100.0% 1596356 0 0 100.0% test-follower-index 3 2.9s snapshot index n/a n/a x.x.x.x d76fd4d86d2307b6xxxxxx replication-remote-repo-connection1 replication-remote-snapshot 24 24 100.0% 24 1596356 1596356 100.0% 1596356 0 0 100.0% test-follower-index 4 2.7s snapshot index n/a n/a x.x.x.x 9ab1495309b8e53xxxxxxx replication-remote-repo-connection1 replication-remote-snapshot 24 24 100.0% 24 1596356 1596356 100.0% 1596456 0 0 100.0%
Note: Cross-cluster replication doesn't support replication of system indices. For more information, see Cross-cluster replication limitations.
-
If the index replication state is syncing, then check the replication status:
GET _plugins/_replication/<index_name>/_status?pretty
Example output:
{ "status" : "PAUSED", "reason" : "Paused by AWS due to burstable instance type", "leader_alias" : "connection1", "leader_index" : "test-leader-index", "follower_index" : "test-follower-index" }
-
In the replication status output, note the "reason" section and take any required action to resolve the replication failure and resume replication.
-
(Optional) You can temporarily pause and resume replication if you need to remediate issues or reduce load on the leader with the following commands:
POST _plugins/_replication/<index_name>/_pause {} POST _plugins/_replication/<index_name>/_resume {} POST _plugins/_replication/<index_name>/_stop {}
Note: You can't resume replication after it's been paused for more than 12 hours. You must stop replication, delete the follower index, and restart replication of the leader.
Auto-follow failures
Auto-follow replication rules check the leader domain for new indices and replicates indices that match a specified pattern.
-
Check state of auto-follow replication rules with the following command:
GET _plugins/_replication/autofollow_stats
Example output:
{ "num_success_start_replication" : 1, "num_failed_start_replication" : 0, "num_failed_leader_calls" : 0, "failed_indices" : [ ".kibana_2", ".opendistro-reports-definitions", ".opendistro-reports-instances", ".kibana_3" ], "autofollow_stats" : [ { "name" : "rule1", "pattern" : "*", "num_success_start_replication" : 1, "num_failed_start_replication" : 0, "num_failed_leader_calls" : 0, "failed_indices" : [ ".kibana_2", ".opendistro-reports-definitions", ".opendistro-reports-instances", ".kibana_3" ], "last_execution_time" : 1679381247239 } ] }
-
Check how long the replication task is running:
GET _cat/tasks?v&actions=cluster:admin/plugins/replication/autofollow[c]&detailed
-
Check the status of individual replication indices with the following command:
GET _plugins/_replication/<index_name>/_status?pretty
Auto-follow restarts
After you have resolved the replication failure, follow these steps to delete and re-recreate the auto-follow rule.
-
Get the list of the failed indices:
GET _cluster/state?pretty&filter_path=metadata.replication_metadata
Note: The output for "REPLICATION_LAST_KNOWN_OVERALL_STATE" should be "FAILED".
-
Stop replication:
POST _plugins/_replication/<failed_index_name>/_stop {}
-
Delete indicies:
DELETE <failed_index_name>
-
Delete the auto-follow rule:
DELETE _plugins/_replication/_autofollow { "leader_alias" : "<connection_alias>", "name": "<rule_name>" }
-
Re-create auto-follow rule with your index pattern:
POST _plugins/_replication/_autofollow { "leader_alias": "<connection_alias>", "name": "<rule_name>", "pattern": "<index_pattern>", "use_roles": { "leader_cluster_role": "<leader_cluster_role>", "follower_cluster_role": "<follower_cluster_role>" } }
Check replication latency
Check for high JVM memory pressure for the leader and follower domains for high latency. If the domain status is healthy, check the LeaderCheckPoint and FollowerCheckPoint replication metrics to determine if latency is increasing or static.
If the LeaderCheckPoint and FollowerCheckPoint metrics are healthy, then the IndexingRate might be too high for the follower domain. You can stop and restart the replication as bootstrap for a faster sync phase.
-
Check the indices replication status for the follower and leader domains:
GET _plugins/_replication/follower_stats?pretty GET _plugins/_replication/leader_stats?pretty
Example output:
{ "num_syncing_indices" : 1, "num_bootstrapping_indices" : 0, "num_paused_indices" : 1, "num_failed_indices" : 0, "num_shard_tasks" : 5, "num_index_tasks" : 1, "operations_written" : 4, "operations_read" : 4, "failed_read_requests" : 0, "throttled_read_requests" : 0, "failed_write_requests" : 0, "throttled_write_requests" : 0, "follower_checkpoint" : -1, "leader_checkpoint" : 2, "total_write_time_millis" : 855, "index_stats" : { "test-follower-index" : { "operations_written" : 4, "operations_read" : 4, "failed_read_requests" : 0, "throttled_read_requests" : 0, "failed_write_requests" : 0, "throttled_write_requests" : 0, "follower_checkpoint" : -1, "leader_checkpoint" : 2, "total_write_time_millis" : 855 } } }
-
Check the shard size for the leader and follower domains with the following command:
GET _cat/shards?v
Related information
Troubleshooting OpenSearch Service
Monitoring OpenSearch cluster metrics with Amazon CloudWatch
Relevant content
- Accepted Answerasked 5 months agolg...
- asked a year agolg...
- asked 2 years agolg...
- asked a year agolg...
- asked 4 months agolg...
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated 3 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago