Skip to content

Large-scale Entity Resolution in Neptune: Feasibility of Deduplicating Company Records Using Vector Embeddings?

0

I'm working on deduplicating company records from multiple data sources using Neptune. The raw dataset is approximately 400M records. I'm considering an approach using vector embeddings to capture company attributes (like name, location, domain) for similarity matching, either as separate embeddings or combined. Key questions:

Can Neptune's graph + vector capabilities efficiently cluster/group similar companies at this scale? What advantages does Neptune's graph structure offer for entity resolution compared to a traditional vector database?

I know Neptune supports vector similarity search for individual queries, but I'm specifically interested in whether it can handle bulk clustering/matching operations across the entire dataset to identify groups of likely duplicate entities.

asked 10 months ago238 views
1 Answer
0

The use of vector embeddings for similarity search and entity resolution will largely depend on what models were used to create the embeddings. Neptune Analytics allows users to bring their own embeddings, so you will need to create the embeddings externally before storing them in your Neptune Analytics graph. Some users may choose to use a text embedding model like Amazon Titan to create embeddings from a node and the node's properties, as they deem items with similar properties a match. On the other end of the spectrum, users may want to train their own Graph Neural Network (GNN) and use the connections in their graph as a means to generate embeddings and determine similarity based on both attributes and connections in the graph.

Having the vector embeddings in the graph allows users to perform graph queries when they find items with similar embeddings. This would allow you to perform a traversal across connections to deterministically see how two or more items are actually connected in the graph. You can think of this as a means of explainability.

Besides vector embeddings, Neptune Analytics also offers a series of similarity algorithms that can also be used to determine similarity between nodes/entities in a graph: https://docs.aws.amazon.com/neptune-analytics/latest/userguide/similarity-algorithms.html

*** UPDATE Jan 15th ***

Given this scale, would you recommend using Neptune's built-in similarity algorithms instead of vector-based approaches? Or perhaps a hybrid approach?

It depends on what connections you've made in your dataset and how those translate into edges that are stored in the graph. For example, if you look at how the Jaccard algo works in NA, you'll note that how things are connected and the neighbors of a node (including which ones are unique neighbors) drive the scoring created by the algo. This could be a good complement to your current use of text embeddings as those embeddings are not necessarily (at least, from your description) taking into account how entities are connected/related in your dataset.

Is it true that Neptune only supports one vector index per graph? This could be limiting as we'd ideally want separate indices for different attributes (company name, location, domain).

Yes, Neptune Analytics only supports one embedding per node. If you only wanted to look at similar nodes based on a label of "Company", you could execute a topK vector similarity search query [1] of:

CALL neptune.algo.vectors.topKByEmbedding(
  [0.1, 0.2, 0.3, ...],
  {
    topK: 100,
    concurrency: 1
  }
)
YIELD embedding, node, score
WHERE labels(node) in ['Company'] 
RETURN embedding, node, score
LIMIT 10

You may have to experiment with topK value to determine how many nodes get returned before you find nodes with a Company label. We know this isn't ideal and would hope to have a better method for this in the future.

How would you approach the clustering/grouping aspect at this scale?

Hard to say, as entity resolution workflows can be heavily dependent on the heuristics of your data and use case. You really have to know what attributes are going to provide the best matches. It really takes some experimentation with data that you know are matches and seeing what the algos or embeddings deliver in terms of scoring and using those test cases as ground truth to develop the best methods for your use case.

[1] https://docs.aws.amazon.com/neptune-analytics/latest/userguide/vectors-topKByEmbedding.html

AWS
answered 10 months ago
  • Thanks for the detailed explanation. Let me provide more context about our approach: We've been testing with generic models like all-MiniLM and text-embedding-ada for initial proof of concept. These worked well for simple similarity searches (e.g., finding similar company names), suggesting that even basic embeddings can capture enough semantic meaning for initial matching.

    However, our scale is significant - we need to process over 400M firms and resolve to a much smaller amount of unique entities. I have a few questions:

    Given this scale, would you recommend using Neptune's built-in similarity algorithms instead of vector-based approaches? Or perhaps a hybrid approach? Is it true that Neptune only supports one vector index per graph? This could be limiting as we'd ideally want separate indices for different attributes (company name, location, domain). How would you approach the clustering/grouping aspect at this scale?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.