- Newest
- Most votes
- Most comments
The use of vector embeddings for similarity search and entity resolution will largely depend on what models were used to create the embeddings. Neptune Analytics allows users to bring their own embeddings, so you will need to create the embeddings externally before storing them in your Neptune Analytics graph. Some users may choose to use a text embedding model like Amazon Titan to create embeddings from a node and the node's properties, as they deem items with similar properties a match. On the other end of the spectrum, users may want to train their own Graph Neural Network (GNN) and use the connections in their graph as a means to generate embeddings and determine similarity based on both attributes and connections in the graph.
Having the vector embeddings in the graph allows users to perform graph queries when they find items with similar embeddings. This would allow you to perform a traversal across connections to deterministically see how two or more items are actually connected in the graph. You can think of this as a means of explainability.
Besides vector embeddings, Neptune Analytics also offers a series of similarity algorithms that can also be used to determine similarity between nodes/entities in a graph: https://docs.aws.amazon.com/neptune-analytics/latest/userguide/similarity-algorithms.html
*** UPDATE Jan 15th ***
Given this scale, would you recommend using Neptune's built-in similarity algorithms instead of vector-based approaches? Or perhaps a hybrid approach?
It depends on what connections you've made in your dataset and how those translate into edges that are stored in the graph. For example, if you look at how the Jaccard algo works in NA, you'll note that how things are connected and the neighbors of a node (including which ones are unique neighbors) drive the scoring created by the algo. This could be a good complement to your current use of text embeddings as those embeddings are not necessarily (at least, from your description) taking into account how entities are connected/related in your dataset.
Is it true that Neptune only supports one vector index per graph? This could be limiting as we'd ideally want separate indices for different attributes (company name, location, domain).
Yes, Neptune Analytics only supports one embedding per node. If you only wanted to look at similar nodes based on a label of "Company", you could execute a topK vector similarity search query [1] of:
CALL neptune.algo.vectors.topKByEmbedding(
[0.1, 0.2, 0.3, ...],
{
topK: 100,
concurrency: 1
}
)
YIELD embedding, node, score
WHERE labels(node) in ['Company']
RETURN embedding, node, score
LIMIT 10
You may have to experiment with topK value to determine how many nodes get returned before you find nodes with a Company label. We know this isn't ideal and would hope to have a better method for this in the future.
How would you approach the clustering/grouping aspect at this scale?
Hard to say, as entity resolution workflows can be heavily dependent on the heuristics of your data and use case. You really have to know what attributes are going to provide the best matches. It really takes some experimentation with data that you know are matches and seeing what the algos or embeddings deliver in terms of scoring and using those test cases as ground truth to develop the best methods for your use case.
[1] https://docs.aws.amazon.com/neptune-analytics/latest/userguide/vectors-topKByEmbedding.html
Relevant content
- asked 2 years ago
- asked a year ago

Thanks for the detailed explanation. Let me provide more context about our approach: We've been testing with generic models like all-MiniLM and text-embedding-ada for initial proof of concept. These worked well for simple similarity searches (e.g., finding similar company names), suggesting that even basic embeddings can capture enough semantic meaning for initial matching.
However, our scale is significant - we need to process over 400M firms and resolve to a much smaller amount of unique entities. I have a few questions:
Given this scale, would you recommend using Neptune's built-in similarity algorithms instead of vector-based approaches? Or perhaps a hybrid approach? Is it true that Neptune only supports one vector index per graph? This could be limiting as we'd ideally want separate indices for different attributes (company name, location, domain). How would you approach the clustering/grouping aspect at this scale?