I've built a few vector search engines too, so this was an immediate red flag: "Brute force takes 12 seconds / query on 1 million vectors of 768 dim".
No, a sane brute-force search (via BLAS) that size should be a ~200ms / query. I.e. SIXTY TIMES faster!
If they (Faiss?) got this wrong, what else did they get wrong?
I understand researchers want to showcase their "best and fastest" approach, so they fudge the baselines. Approximate search can be genuinely useful – orders of magnitude faster than (even non-fudged) brute force, and using less RAM too.
But as a user, tech stack complexity is also a consideration. Because the trade-off is not only "speed vs accuracy". Brute force is a trivial algorithm, easy to implement and maintain with no corner cases. It has completely predictable data access patterns (linear, sequential, fixed response time, 100% accuracy). It supports operations (update, range, dynamic k-NN) that complex indexes struggle with.
So if your dataset is tiny – and anything under 1 million counts as tiny – do you really need to maintain an external dependency of fancy data structures and approximate algorithms?
I like how you tested "query multiple vectors at once". Super useful when documents can be batched, for increased throughput. If I'm reading your benchmark correctly, Faiss brute-force can do a batch query of 10,000 vectors in ~19 seconds => 1.9 ms per vector.
That's pretty cool – and more than 130x faster than querying those 10,000 vectors individually, one by one.
There is a lot done vector search technology right now.
I was less fortunate when looking at ways to store the vectors in databases.
I already looked at Pinecone or Weaviate but they are all paid products.
A lot of VCs & founders trying to commercialize other people's SW are wondering this too :)
AFAICT: Most of those are basically UIs, data management, & integrations around the same set of vector index libs like FAISS + same set of models like HF, and even the same set of inference server libs (triton, aws/gcp, fastapi, ...). So you'd be evaluating different commercializations of the same core OSS tech. There are useful evaluations to do at that level, but more like licensing, business model, UI, model management, etc.
Another commenter below recognized that regular DBs (ES, Postgres, ..) are starting to add vector indexes. As someone doing a lot of architecting for log/event/graph correlation/investigation systems, I've been tracking whether managed neural search db is a feature vs new category, and how big. Ex: those corporate sides definitely feels like an Algolia competitor, but for the 99% case we normally do, maybe just an OSS feature/lib of whatever DB/framework you're already using? Not obvious!
Can someone tell me if these vector search things have anything in common with postgres's text search vectors, which have been implemented in postgres for quite a while now?
Not much in common. A "vector" in Postgres is a tokenized and normalized array of words. So 'a fat cat sat on a mat and ate a fat rat' becomes 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'. This just makes keyword searches a bit easier. (Source: https://www.postgresql.org/docs/9.4/datatype-textsearch.html)
A "vector" in vector search solutions is a dense vector generated by a transformer model. A sentence like 'a fat cat sat on a mat and ate a fat rat' put through a model becomes an array of floating-point numbers like [0.183, -0.774, ...], with hundreds of values (often 768). The point is that this vector is positioned in a 768-dimensional space in close proximity to semantically similar sentences. So then searching by semantic meaning is a simple (well...) exercise in measuring the distance between your query and the surrounding vectors.
>A "vector" in vector search solutions is a dense vector generated by a transformer model.
Just a clarification that vectors are much broader than text vectors generated by transformer models. A more common application has been recommender models built on matrix factorization and other similar approaches. Word 2 Vec was another popular way to generate vectors dating back to 2013. Vectors are a very general approach with many benefits regardless of how they were generated. That is what makes these vector search libraries so exciting.
At what point does it make sense to switch to using vector search solutions (compared to keyword searches)? Obviously Google et al need it, but for regular apps, maybe academic repositories and so on, is there a threshold in which we can start the discussion to switch?
Or phrased another way, when do we know we should be looking at vector search solutions to enhance the search in our application?
When your users start complaining that your search sucks. :)
Either because your catalog has grown to the point where a traditional keyword search makes it harder to find relevant items, or because users are increasingly expecting their apps to just "know what they mean" (like Google, Spotify, Amazon, and Netflix do).
I run a super niche academic database - there is often that expectation that "search should know what I mean", yet there isn't enough data (in my opinion) to make any semantic search meaningful. There are about 23k data objects that can interlink, 50% of those belong to one specific data type, the rest are split.
So we've stuck with simple keyword search with filters to drill into specific categories within results, all pretty vanilla on a relational database structure.
Just wondering if there is a "volume" heuristic to this - I'd like to explore this more but realistically sometimes the academic user-base has big dreams with severe practical limitations.
It comes from the dimensionality of the hidden state of popular NLP models. The number 768 in particular comes from (if I recall correctly) the largest of the original BERT models.
This happens a lot. Usually starts with one popular post about a topic. Then someone explores deeper and finds another interesting article on the same topic, and submits it.
It has happened that I have read about one article and find something related (following links or looking things up) that I find even more interesting and then submit that.
Google announced their new vector search service today, and that prompted someone on HN to post a related library that they knew about. It’s certainly possible that that person works at FB, but that doesn’t mean it’s some kind of orchestrated PR maneuver.
*Facebook is a bad company that puts profit over people and they should be boycotted. People that work at Facebook are complicit to its policy and so also their publications should be boycotted.
The official documentation on Faiss is rather light, so we made “The Missing Manual to Faiss”: https://www.pinecone.io/learn/faiss-tutorial/
Previous discussion: https://news.ycombinator.com/item?id=29291047