Skip to content

Inconsistent k-NN result counts between primary and replica shards in Amazon OpenSearch Service. Are there any mitigations besides segment replication?

0

We are observing inconsistent result counts for identical k-NN vector search queries in Amazon OpenSearch Service.

Issue

The same k-NN query, executed against the same index with the same query vector and filters, returns a different number of hits depending on whether the request is routed to the primary shard or a replica shard.

Using the preference parameter:

  • preference=_primary returns X results
  • preference=_replica returns Y results
  • without preference, the result count can be either X or Y depending on which shard copy serves the request

So the query is identical, but the hit count is not stable across shard copies.

Setup

  • Amazon OpenSearch Service 3.3
  • k-NN vector search using HNSW
  • same query executed repeatedly
  • same index and same data
  • no data changes between test runs
  • tested by explicitly routing to primary and replica with preference

What we found so far

Our current understanding is that this is related to how HNSW graphs are built and stored:

  • each shard copy builds its own HNSW graph independently
  • HNSW graph construction is non-deterministic
  • even with identical documents, the graph structure can differ slightly between primary and replicas
  • those differences can lead to different traversal paths during approximate nearest neighbor search
  • this can produce different result counts or slightly different top-k results

Why we think normal tuning may not fully solve it

We reviewed the usual HNSW parameters:

  • m
  • ef_search
  • ef_construction

Our understanding is that these settings can improve recall and reduce approximation error, but they do not make graph construction deterministic. So even with higher values, results may become closer, but not necessarily identical across primary and replica shards.

Mitigation we are already aware of

The main solution we found is segment replication.

Our understanding is:

  • the primary builds the index segment, including the HNSW graph
  • replicas copy the same segment instead of building their own graph
  • this should make the graph structure identical across shard copies and produce consistent search results

For our use case, segment replication has trade-offs such as:

  • increased network usage
  • possible replication lag
  • higher load on the primary node

Because of that, we want to understand whether there are any other practical mitigations.

Question

We understand that segment replication can address this issue, but are there any other tuning options or recommended mitigations to reduce result inconsistency between primary and replica shards in Amazon OpenSearch Service?

asked 17 days ago47 views
1 Answer
0

Based on the current architecture of k-NN in Amazon OpenSearch Service, segment replication is the primary and most effective mitigation for ensuring consistent k-NN results between primary and replica shards.

Your analysis is correct: the inconsistency you're observing stems from the non-deterministic nature of HNSW graph construction. When using document replication (the default), each shard copy independently builds its own HNSW graph from the same documents, and even with identical data, the graph structure can vary slightly, leading to different traversal paths and result counts during approximate nearest neighbor searches.

While tuning HNSW parameters like ef_search, ef_construction, and m can improve overall recall quality and reduce approximation error, these parameters do not make the graph construction process deterministic. Higher values may bring results closer together across shards, but they won't guarantee identical results when each shard builds its own independent graph.

Unfortunately, there are no other documented mitigations in Amazon OpenSearch Service that can guarantee result consistency across primary and replica shards for k-NN queries besides segment replication. The fundamental issue is that as long as each shard copy builds its own graph independently, some degree of variation is inherent to the approximate nature of HNSW.

If the trade-offs of segment replication (increased network usage, potential replication lag, and higher primary node load) are concerns for your use case, you would need to accept some level of result inconsistency as a characteristic of approximate nearest neighbor search with independent graph construction. However, segment replication remains the recommended approach when result consistency is a critical requirement for your application.
Sources
Amazon OpenSearch Service 101: How many shards do I need | AWS Big Data Blog
k-Nearest Neighbor (k-NN) search in Amazon OpenSearch Service - Amazon OpenSearch Service

answered 17 days ago
EXPERT
reviewed 17 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.