How can I troubleshoot cross-cluster replication failures on my Amazon OpenSearch Service cluster?

5 minute read
0

My cross-cluster replication isn’t working on my Amazon OpenSearch Service cluster.

Description

You can set up a cross-cluster connection to replicate indexes from one domain to another. Before you begin, make sure that you adhere to the limitations, prerequisites, and permissions requirements.
Note: cross-cluster replication doesn't work with data streams. For more information, see data streams on the OpenSearch website.

Resolution

Follow these troubleshooting steps for your use case.
Note: If you activated OpenSearch Service error logs, you can get additional troubleshooting information. For more information, see Viewing OpenSearch Service error logs.

Check the replication task status

  1. Check the progress of the bootstrapping state using the following command:

    GET _cat/recovery?active_only=true

    Example output:

    index                      shard time  type         stage source_host   source_node  target_host  target_node                         repository                                            snapshot                               files files_recovered files_percent files_total  bytes       bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent
    test-follower-index 0        1.8s  snapshot index n/a                  n/a                 x.x.x.x          d76fd4d86d2307b6xxxxxx replication-remote-repo-connection1 replication-remote-snapshot 24   24                     100.0%         24             1596356  1596356              100.0%           1596356     0                    0                                     100.0%
    test-follower-index 1        2.8s  snapshot index n/a                  n/a                 x.x.x.x          9ab1495309b8e53axxxxxx replication-remote-repo-connection1 replication-remote-snapshot 24   24                     100.0%         24             1596356  1596356              100.0%           1596356     0                    0                                     100.0%
    test-follower-index 2        1.8s  snapshot index n/a                  n/a                 x.x.x.x          d76fd4d86d2307b6xxxxxx replication-remote-repo-connection1 replication-remote-snapshot 24   24                     1000.0%       24             1596356  1596356              100.0%           1596356     0                     0                                    100.0%
    test-follower-index 3        2.9s  snapshot index n/a                  n/a                 x.x.x.x          d76fd4d86d2307b6xxxxxx replication-remote-repo-connection1 replication-remote-snapshot 24   24                     100.0%         24             1596356  1596356              100.0%           1596356     0                     0                                    100.0%
    test-follower-index 4        2.7s  snapshot index n/a                  n/a                 x.x.x.x          9ab1495309b8e53xxxxxxx replication-remote-repo-connection1 replication-remote-snapshot 24   24                     100.0%         24             1596356  1596356              100.0%           1596456     0                     0                                    100.0%
    

    Note: Cross-cluster replication doesn't support replication of system indices. For more information, see Cross-cluster replication limitations.

  2. If the index replication state is syncing, then check the replication status:

    GET _plugins/_replication/<index_name>/_status?pretty

    Example output:

    {
      "status" : "PAUSED",
      "reason" : "Paused by AWS due to burstable instance type",
      "leader_alias" : "connection1",
      "leader_index" : "test-leader-index",
      "follower_index" : "test-follower-index"
    }
  3. In the replication status output, note the "reason" section and take any required action to resolve the replication failure and resume replication.

  4. (Optional) You can temporarily pause and resume replication if you need to remediate issues or reduce load on the leader with the following commands:

    POST _plugins/_replication/<index_name>/_pause
    {}
    
    POST _plugins/_replication/<index_name>/_resume
    {}
    
    POST _plugins/_replication/<index_name>/_stop
    {}

    Note: You can't resume replication after it's been paused for more than 12 hours. You must stop replication, delete the follower index, and restart replication of the leader.

Auto-follow failures

Auto-follow replication rules check the leader domain for new indices and replicates indices that match a specified pattern.

  1. Check state of auto-follow replication rules with the following command:

    GET _plugins/_replication/autofollow_stats

    Example output:

    {
      "num_success_start_replication" : 1,
      "num_failed_start_replication" : 0,
      "num_failed_leader_calls" : 0,
      "failed_indices" : [
        ".kibana_2",
        ".opendistro-reports-definitions",
        ".opendistro-reports-instances",
        ".kibana_3"
      ],
      "autofollow_stats" : [
        {
          "name" : "rule1",
          "pattern" : "*",
          "num_success_start_replication" : 1,
          "num_failed_start_replication" : 0,
          "num_failed_leader_calls" : 0,
          "failed_indices" : [
            ".kibana_2",
            ".opendistro-reports-definitions",
            ".opendistro-reports-instances",
            ".kibana_3"
          ],
          "last_execution_time" : 1679381247239
        }
      ]
    }
  2. Check how long the replication task is running:

    GET _cat/tasks?v&actions=cluster:admin/plugins/replication/autofollow[c]&detailed
  3. Check the status of individual replication indices with the following command:

GET _plugins/_replication/<index_name>/_status?pretty

Auto-follow restarts

After you have resolved the replication failure, follow these steps to delete and re-recreate the auto-follow rule.

  1. Get the list of the failed indices:

    GET _cluster/state?pretty&filter_path=metadata.replication_metadata

    Note: The output for "REPLICATION_LAST_KNOWN_OVERALL_STATE" should be "FAILED".

  2. Stop replication:

    POST _plugins/_replication/<failed_index_name>/_stop
    {}
  3. Delete indicies:

    DELETE <failed_index_name>
  4. Delete the auto-follow rule:

    DELETE _plugins/_replication/_autofollow
    
    {
    
    "leader_alias" : "<connection_alias>",
    
    "name": "<rule_name>"
    
    }
  5. Re-create auto-follow rule with your index pattern:

    POST _plugins/_replication/_autofollow
    {
        "leader_alias": "<connection_alias>",
        "name": "<rule_name>",
        "pattern": "<index_pattern>",
        "use_roles": {
            "leader_cluster_role": "<leader_cluster_role>",
            "follower_cluster_role": "<follower_cluster_role>"
        }
    }

Check replication latency

Check for high JVM memory pressure for the leader and follower domains for high latency. If the domain status is healthy, check the LeaderCheckPoint and FollowerCheckPoint replication metrics to determine if latency is increasing or static.

If the LeaderCheckPoint and FollowerCheckPoint metrics are healthy, then the IndexingRate might be too high for the follower domain. You can stop and restart the replication as bootstrap for a faster sync phase.

  1. Check the indices replication status for the follower and leader domains:

    GET _plugins/_replication/follower_stats?pretty
    
    GET _plugins/_replication/leader_stats?pretty
    

    Example output:

    {
      "num_syncing_indices" : 1,
      "num_bootstrapping_indices" : 0,
      "num_paused_indices" : 1,
      "num_failed_indices" : 0,
      "num_shard_tasks" : 5,
      "num_index_tasks" : 1,
      "operations_written" : 4,
      "operations_read" : 4,
      "failed_read_requests" : 0,
      "throttled_read_requests" : 0,
      "failed_write_requests" : 0,
      "throttled_write_requests" : 0,
      "follower_checkpoint" : -1,
      "leader_checkpoint" : 2,
      "total_write_time_millis" : 855,
      "index_stats" : {
        "test-follower-index" : {
          "operations_written" : 4,
          "operations_read" : 4,
          "failed_read_requests" : 0,
          "throttled_read_requests" : 0,
          "failed_write_requests" : 0,
          "throttled_write_requests" : 0,
          "follower_checkpoint" : -1,
          "leader_checkpoint" : 2,
          "total_write_time_millis" : 855
        }
      }
    }
  2. Check the shard size for the leader and follower domains with the following command:

     GET _cat/shards?v

Related information

Troubleshooting OpenSearch Service

Monitoring OpenSearch cluster metrics with Amazon CloudWatch

AWS OFFICIAL
AWS OFFICIALUpdated 8 months ago