The biggest immediate useful difference that I see is that Annoy uses read-only index files (from the docs: "you can not add more items once the tree has been created" [0]), while Voyager allows you to call `.add_item` at any time (I just pip installed to double check and yes -- it's great).
The great thing about Annoy is that you can write the index to disk and thus do big data work on tiny workers at the edge. I've never really seen anything else do the same.
Oh yeah, Annoy is definitely mmap'ed i.e. you can use the index without loading the index file into memory. And that's very useful.
As far as I can see, Voyager requires you to load the index into memory and doesn't (yet?) do mmap. Which... would make sense since you can change the data after loading the index. So, Voyager index files are fully loaded in memory..? Do I have this right?
That sound pretty much like binary space partitioning [1] which is probably best known for being used in Doom to accelerate the rendering. Also, if you only search the leaf subspace that you query point lies in, then you can in principle be arbitrarily far off from the true nearest neighbor.
k-D trees and BSPs don't work well in high dimensions (say >20 to 30 dims), since your true nearest neighbor is highly likely to lie on either side of any dividing hyperplane.
If each dividing hyperplane separates a set of vectors into two roughly equal parts, then for N vectors you have something on the order of log2(N) separating planes in order to get down to a single vector per terminal node. But for, say, a billion vectors in a 100-dimensional space (and assuming something like a k-D tree where you partition each dimension at each level), you have only partitioned log2(10^9) ~= 30 dimensions this way, yet there are 70 other dimensions remaining through which you can travel to find your true nearest neighbor (the curse of dimensionality; in high dimensions everything is "near" everything else) that you have not checked if you are discarding subspaces along the way.
This is a problem for other approximate methods as well though. IVF methods still do a form of partitioning (via Voronoi cells) but this is obtained by clustering the data so it is more attuned to the global structure of the data rather than greedy divide and conquer. But in IVF methods it is still possible that you need to check every IVF cell in order to find your true nearest neighbor, just that it is less likely to happen in practice because the partitioning is more attuned to the overall structure of data rather than via greedy subdivision.
I wonder if nearest neighbor is actually what you want to begin with. Imagine points distributed in a U-shape, the nearest neighbor might require jumping across the gap but a better match could be a bit further away but somewhere along the shape. Especially if the dimension do not have a meaningful intrinsic scale, I could just scale the x dimension and pull the two sides of the U apart. So maybe I could actually be more useful to look at the topology of the points, something like a Vietoris–Rips complex [1] but invariant under [local] rescaling of dimensions. Not sure if something like that exists or is possible in a robust and efficient way.
I think you would prefer something like a support vector machine over k-nn for better learned point-wise local rescaling (you could also use eg ISOMAP or MDS to try to linearize the manifold), which you could also use w various tda methods (see the giotto-tda github repo jupyter notebooks for some similar examples)
Funny I got interested in nearest-neighbor search for this kind of vectors in the early 2000s and there were a lot of papers on the subject that were depressing (had to make a tradeoff that wasn't really attractive) compared to the literature on low-dimensional indexes. Back then I had 250,000 or so vectors and a full scan wasn't that bad.
In the 2010s I worked on a dense vector search engine for patents and we had maybe 10 million vectors and still used a scan, I mean, you couldn't ask for a more memory hierarchy and SIMD friendly algorithm.
Today I am looking at 1M (larger) vectors and the full scan is still possible but I am using FAISS because it is a bird in the hand and I decided I can live with the tradeoff.
If you want an easy way to evaluate Faiss, Hnswlib and Annoy vector backends, check out txtai - https://github.com/neuml/txtai. txtai also supports NumPy and PyTorch vector storage.
Does anyone have any recommendations for a decent crash course on using vector DBs in conjuction with LLMs? I wanna do some experimentation with getting a model to comment on the similarity of data vectors etc. and I don't really know where to start.
If you want to experiment with vector stores, you can do that locally with something like faiss which has good multiplatform support and sufficient tutorials: https://github.com/facebookresearch/faiss
Doing full retrieval-augmented generation (RAG) and getting LLMs to interpret the results has more steps but you get a lot of flexibility, and despite what AI influencers say there's no standard best-practice. When you query a vector DB you get the most similar texts back (or an index integer in the case of faiss), you then feed those result to an LLM like a normal prompt, which can be optimized with prompt engineering.
The codifer for the RAG workflow is LangChain, but their demo is substantially more complex and harder-to-use than even a homegrown implementation: https://minimaxir.com/2023/07/langchain-problem/
Also, if what you look up has no semantic meaning like parts number you might be better off with an inverted index in addition to ANN lookups. Especially if the embedding model has been trained on a dataset that is not similar to what you use it for. That's a common situation right now with embedding models based on LLMs.
I recommend pgvector, it's simple and well featured with good example code. Once you have a dataset of vectors loaded in, the next step is called rag / retrieval augmented generation
I was so saddened by this title.. I thought there was a way to find annoying neighbors (at homes I’m interested in acquiring) and assumed this was a service for literally finding nearby annoy[ing] neighbors. To then go through the comments and find that this is a service with a minimum input of addresses and census data which are both freely available as is this service. Bravo hn!
This is so overcomplicated. You don't need an actual tree that's just a mental model. All you need to do is generate N random hyperplanes (defined via it's normal i.e. generate N normalized vectors in the dimension of your choosing). Then you generate a hash via these by going normal by normal and checking if your query vector `q` is on the left or right side of the normal by checking the sign of the dot product. I.e. `hash = [sign(dot(q, n1)), sign(dot(q, n2)), ...]`.
So given a query you find that hash and you get all the vectors with the same hash (using your favorite hash to array data structure, i.e. a dict) and you can further refine the search within that bucket. It's like 10 rows of code.
Annoy author here. What you're describing is Locality Sensitive Hashing (LSH) which is something I spent a lot of time trying back in 2009-2012 but never got it working. It has some elegant theoretical properties, but empirically it always has terrible performance. The reason I don't think it works well is that data often lies near a lower-dimensionality manifold that may have some sort of shape. LSH would "waste" most splits (because it doesn't understand the data distribution) but using trees (and finding splits such that the point set is partitioned well) ends up "discovering" the data distribution better.
(but HNSW is generally much better than this tree paritioning scheme)
Faiss GPU author here. Vector binarization via LSH for dense vector, high-dimensional (say 30 - 2000ish dimensions) tends to have poor recall and quality of results. As the Annoy author implied here as well, there's a lot of literature on LSH but that's because it is easier to analyze mathematically and prove theorems regarding it, not that it produces good quality results compared to graph-based (HNSW etc) or different geometric cell probe-based (IVF etc) methods. LSH is not typically used in industrial scale settings for these reasons.
However, for extremely high dimensional sparse vectors (think tens of thousands to millions of dimensions), LSH does remain the only practical approximate technique to my knowledge. Online settings (where you need to insert vectors quickly into the database) may prefer LSH as well as the overhead is smaller than updating a graph or finding the IVF cell in which to place the vector.
Below ~30 dimensions, exact non-brute force methods (like k-D trees or binary space partitioning methods) work ok, but they have high overheads above that because your true nearest neighbor is still highly likely to lie on either side of any dividing hyperplane (it doesn't partition the dataset well; everything is usually "near" everything else in practice in high dimensions).
It seems to me that your approach is making some assumptions about how the data is distributed that aren't necessarily guaranteed to hold in practice, and a tree has a chance of being robust to at least some of those. For example, whatever random hyperplane you choose has to be informed by your data. Suppose you make a particularly bad choice for one of those hyperplanes, with your approach you'll be stuck with that for all time and have a bit in your hash that's biased to heck, but in the tree approach you only suffer when you hash your way down to the lousy node, and all other paths will be unaffected.
I implemented this recently in C as a numpy extension[1], for. fun. Even had a vectorized solution going.
You'll get diminishing returns on recall pretty fast. There's actually a theorem that tracks this - Jordan-Lindenstrauss lemma[2] if you're interested.
As I mention in a talk I gave[3], it can work if you're going to rerank anyway. And whatever vector search thing isn't the main ranking signal. And you don't have that much data. It's also easy to update, as the hashes are non-parametric (they don't depend on the data). Constantly updating vector search systems is often neglected in benchmarks.
The lack of data-dependency, however is the main problem. Vector spaces are lumpy. You can see this in the distribution of human beings on the surface of the earth - postal codes and area codes vary from small to huge - random hashes, like a grid, wouldn't let you accurately map out the distribution of all the people or clump them close to their actual nearest neighbors. Manhattan is not rural Alaska... and more dimensions, more data points, the more ways it has to be lumpy.
Annoy, actually, builds on these hashes, by creating many trees of such hashes, and then finds a split in the left and right. Then in creates a forest of such trees. So its essentially a forest of random hash trees with data dependency.
I've implemented the same approach (non-parametric), and it worked well to speed up the search on my experiment with 384-dimensional embeddings of 30 million Hacker News comments.
[1]: https://github.com/spotify/voyager [2]: https://engineering.atspotify.com/2023/10/introducing-voyage...