Hacker News new | past | comments | ask | show | jobs | submit login
Faiss: Facebook's open source vector search library (github.com/facebookresearch)
175 points by ai_ja_nai on Dec 14, 2021 | hide | past | favorite | 38 comments



Fun seeing vector search all over the front page today. :)

The official documentation on Faiss is rather light, so we made “The Missing Manual to Faiss”: https://www.pinecone.io/learn/faiss-tutorial/

Previous discussion: https://news.ycombinator.com/item?id=29291047


I've built a few vector search engines too, so this was an immediate red flag: "Brute force takes 12 seconds / query on 1 million vectors of 768 dim".

No, a sane brute-force search (via BLAS) that size should be a ~200ms / query. I.e. SIXTY TIMES faster!

If they (Faiss?) got this wrong, what else did they get wrong?

I understand researchers want to showcase their "best and fastest" approach, so they fudge the baselines. Approximate search can be genuinely useful – orders of magnitude faster than (even non-fudged) brute force, and using less RAM too.

But as a user, tech stack complexity is also a consideration. Because the trade-off is not only "speed vs accuracy". Brute force is a trivial algorithm, easy to implement and maintain with no corner cases. It has completely predictable data access patterns (linear, sequential, fixed response time, 100% accuracy). It supports operations (update, range, dynamic k-NN) that complex indexes struggle with.

So if your dataset is tiny – and anything under 1 million counts as tiny – do you really need to maintain an external dependency of fancy data structures and approximate algorithms?


1 vector against 1 million vectors in 768 dims at k = 10 takes 259 ms for me using Faiss CPU IndexFlatL2 with Intel MKL:

https://gist.github.com/wickedfoo/165b69075cfcceba872aec1c46...


Sounds about right.

I like how you tested "query multiple vectors at once". Super useful when documents can be batched, for increased throughput. If I'm reading your benchmark correctly, Faiss brute-force can do a batch query of 10,000 vectors in ~19 seconds => 1.9 ms per vector.

That's pretty cool – and more than 130x faster than querying those 10,000 vectors individually, one by one.


There is a lot done vector search technology right now. I was less fortunate when looking at ways to store the vectors in databases. I already looked at Pinecone or Weaviate but they are all paid products.

Is there some people having feedback on this?


A lot of VCs & founders trying to commercialize other people's SW are wondering this too :)

AFAICT: Most of those are basically UIs, data management, & integrations around the same set of vector index libs like FAISS + same set of models like HF, and even the same set of inference server libs (triton, aws/gcp, fastapi, ...). So you'd be evaluating different commercializations of the same core OSS tech. There are useful evaluations to do at that level, but more like licensing, business model, UI, model management, etc.

Another commenter below recognized that regular DBs (ES, Postgres, ..) are starting to add vector indexes. As someone doing a lot of architecting for log/event/graph correlation/investigation systems, I've been tracking whether managed neural search db is a feature vs new category, and how big. Ex: those corporate sides definitely feels like an Algolia competitor, but for the 99% case we normally do, maybe just an OSS feature/lib of whatever DB/framework you're already using? Not obvious!


Another option is Milvus:https://github.com/milvus-io. It's an open-source vector database.


Milvus is a layer on top of FAISS.


I don’t know when you last looked but as of a few months ago Pinecone has a free tier that fits 1M items with <100ms latency.




Can someone tell me if these vector search things have anything in common with postgres's text search vectors, which have been implemented in postgres for quite a while now?


Not much in common. A "vector" in Postgres is a tokenized and normalized array of words. So 'a fat cat sat on a mat and ate a fat rat' becomes 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'. This just makes keyword searches a bit easier. (Source: https://www.postgresql.org/docs/9.4/datatype-textsearch.html)

A "vector" in vector search solutions is a dense vector generated by a transformer model. A sentence like 'a fat cat sat on a mat and ate a fat rat' put through a model becomes an array of floating-point numbers like [0.183, -0.774, ...], with hundreds of values (often 768). The point is that this vector is positioned in a 768-dimensional space in close proximity to semantically similar sentences. So then searching by semantic meaning is a simple (well...) exercise in measuring the distance between your query and the surrounding vectors.

We have a whole course on the topic, coincidentally also on the front page of HN today: https://www.pinecone.io/learn/nlp


>A "vector" in vector search solutions is a dense vector generated by a transformer model.

Just a clarification that vectors are much broader than text vectors generated by transformer models. A more common application has been recommender models built on matrix factorization and other similar approaches. Word 2 Vec was another popular way to generate vectors dating back to 2013. Vectors are a very general approach with many benefits regardless of how they were generated. That is what makes these vector search libraries so exciting.

I wrote an article about the power of matrix factorization vectors for music recommendations back in 2016: https://tech.iheart.com/mapping-the-world-of-music-using-mac...

We also discussed how we used convolutional neural networks (deep networks, but not transformers) to build vectors on the acoustic content of music: https://tech.iheart.com/mapping-the-world-of-music-using-mac...


Thanks for the nice and clear summary!

At what point does it make sense to switch to using vector search solutions (compared to keyword searches)? Obviously Google et al need it, but for regular apps, maybe academic repositories and so on, is there a threshold in which we can start the discussion to switch?

Or phrased another way, when do we know we should be looking at vector search solutions to enhance the search in our application?


When your users start complaining that your search sucks. :)

Either because your catalog has grown to the point where a traditional keyword search makes it harder to find relevant items, or because users are increasingly expecting their apps to just "know what they mean" (like Google, Spotify, Amazon, and Netflix do).


I run a super niche academic database - there is often that expectation that "search should know what I mean", yet there isn't enough data (in my opinion) to make any semantic search meaningful. There are about 23k data objects that can interlink, 50% of those belong to one specific data type, the rest are split.

So we've stuck with simple keyword search with filters to drill into specific categories within results, all pretty vanilla on a relational database structure.

Just wondering if there is a "volume" heuristic to this - I'd like to explore this more but realistically sometimes the academic user-base has big dreams with severe practical limitations.


You can train a good sentence transformer on ~10K sentences using TSDAE (an unsupervised training approach), covered in Chapter 7 here: https://www.pinecone.io/learn/unsupervised-training-sentence...


Why 768? What are these dimensions


It comes from the dimensionality of the hidden state of popular NLP models. The number 768 in particular comes from (if I recall correctly) the largest of the original BERT models.


So it's arbitrary then


No, those are standard full text indexing datastruftured of vectors representing word positions.

This is for finding nearby N-dimensional points, where N is typically greater than 50 and the point is the output of an ML or NN process.


Suddenly vector search is relevant. Was this orchestrated?


This happens a lot. Usually starts with one popular post about a topic. Then someone explores deeper and finds another interesting article on the same topic, and submits it.


It's an old, mutually beneficial PR arrangement known as the Newton-Leibniz calculus


It has happened that I have read about one article and find something related (following links or looking things up) that I find even more interesting and then submit that.


I feel the same. It is not by chance I would think.


Not everything is a conspiracy.

Google announced their new vector search service today, and that prompted someone on HN to post a related library that they knew about. It’s certainly possible that that person works at FB, but that doesn’t mean it’s some kind of orchestrated PR maneuver.


Another HN post from today on vector search.

https://news.ycombinator.com/item?id=29554986


And I'm wondering if these are all making the front page because of this Show HN from earlier.

https://news.ycombinator.com/item?id=29551947


How does on update the index? All I see at a quick glance is how to create/re-create/train/re-train the dataset from scratch


Funny to see this thread, as I recall Google just recently posted a similar vector search library.


But nobody is posting benchmarks.


Are Facebook's look-alike audiences implemented using faiss?


What is vector search



Show me the benchmark results!


fuck facebook*

*Facebook is a bad company that puts profit over people and they should be boycotted. People that work at Facebook are complicit to its policy and so also their publications should be boycotted.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: