Faiss: Facebook's open source vector search library

gk1 · on Dec 14, 2021

Fun seeing vector search all over the front page today. :)

The official documentation on Faiss is rather light, so we made “The Missing Manual to Faiss”: https://www.pinecone.io/learn/faiss-tutorial/

Previous discussion: https://news.ycombinator.com/item?id=29291047

Radim · on Dec 15, 2021

I've built a few vector search engines too, so this was an immediate red flag: "Brute force takes 12 seconds / query on 1 million vectors of 768 dim".

No, a sane brute-force search (via BLAS) that size should be a ~200ms / query. I.e. SIXTY TIMES faster!

If they (Faiss?) got this wrong, what else did they get wrong?

I understand researchers want to showcase their "best and fastest" approach, so they fudge the baselines. Approximate search can be genuinely useful – orders of magnitude faster than (even non-fudged) brute force, and using less RAM too.

But as a user, tech stack complexity is also a consideration. Because the trade-off is not only "speed vs accuracy". Brute force is a trivial algorithm, easy to implement and maintain with no corner cases. It has completely predictable data access patterns (linear, sequential, fixed response time, 100% accuracy). It supports operations (update, range, dynamic k-NN) that complex indexes struggle with.

So if your dataset is tiny – and anything under 1 million counts as tiny – do you really need to maintain an external dependency of fancy data structures and approximate algorithms?

jhj · on Dec 15, 2021

1 vector against 1 million vectors in 768 dims at k = 10 takes 259 ms for me using Faiss CPU IndexFlatL2 with Intel MKL:

https://gist.github.com/wickedfoo/165b69075cfcceba872aec1c46...

Radim · on Dec 16, 2021

Sounds about right.

I like how you tested "query multiple vectors at once". Super useful when documents can be batched, for increased throughput. If I'm reading your benchmark correctly, Faiss brute-force can do a batch query of 10,000 vectors in ~19 seconds => 1.9 ms per vector.

That's pretty cool – and more than 130x faster than querying those 10,000 vectors individually, one by one.

Kydlaw · on Dec 14, 2021

There is a lot done vector search technology right now. I was less fortunate when looking at ways to store the vectors in databases. I already looked at Pinecone or Weaviate but they are all paid products.

Is there some people having feedback on this?

lmeyerov · on Dec 14, 2021

A lot of VCs & founders trying to commercialize other people's SW are wondering this too :)

AFAICT: Most of those are basically UIs, data management, & integrations around the same set of vector index libs like FAISS + same set of models like HF, and even the same set of inference server libs (triton, aws/gcp, fastapi, ...). So you'd be evaluating different commercializations of the same core OSS tech. There are useful evaluations to do at that level, but more like licensing, business model, UI, model management, etc.

Another commenter below recognized that regular DBs (ES, Postgres, ..) are starting to add vector indexes. As someone doing a lot of architecting for log/event/graph correlation/investigation systems, I've been tracking whether managed neural search db is a feature vs new category, and how big. Ex: those corporate sides definitely feels like an Algolia competitor, but for the 99% case we normally do, maybe just an OSS feature/lib of whatever DB/framework you're already using? Not obvious!

kateshao0510 · on Dec 15, 2021

Another option is Milvus:https://github.com/milvus-io. It's an open-source vector database.

forgotmyoldacc · on Dec 15, 2021

Milvus is a layer on top of FAISS.

gk1 · on Dec 14, 2021

I don’t know when you last looked but as of a few months ago Pinecone has a free tier that fits 1M items with <100ms latency.

homarp · on Dec 15, 2021

Have you looked at https://github.com/jina-ai/executor-hnsw-postgres ?

kristjansson · on Dec 14, 2021

See also: Ann-benchmarks[0], Annoy[1], ScaNN[2]

[0]: http://ann-benchmarks.com [1]: https://github.com/spotify/annoy [2]: https://github.com/google-research/google-research/tree/mast...

jean_valje4n · on Dec 15, 2021

Can someone tell me if these vector search things have anything in common with postgres's text search vectors, which have been implemented in postgres for quite a while now?

gk1 · on Dec 15, 2021

Not much in common. A "vector" in Postgres is a tokenized and normalized array of words. So 'a fat cat sat on a mat and ate a fat rat' becomes 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'. This just makes keyword searches a bit easier. (Source: https://www.postgresql.org/docs/9.4/datatype-textsearch.html)

A "vector" in vector search solutions is a dense vector generated by a transformer model. A sentence like 'a fat cat sat on a mat and ate a fat rat' put through a model becomes an array of floating-point numbers like [0.183, -0.774, ...], with hundreds of values (often 768). The point is that this vector is positioned in a 768-dimensional space in close proximity to semantically similar sentences. So then searching by semantic meaning is a simple (well...) exercise in measuring the distance between your query and the surrounding vectors.

We have a whole course on the topic, coincidentally also on the front page of HN today: https://www.pinecone.io/learn/nlp

rm999 · on Dec 15, 2021

>A "vector" in vector search solutions is a dense vector generated by a transformer model.

Just a clarification that vectors are much broader than text vectors generated by transformer models. A more common application has been recommender models built on matrix factorization and other similar approaches. Word 2 Vec was another popular way to generate vectors dating back to 2013. Vectors are a very general approach with many benefits regardless of how they were generated. That is what makes these vector search libraries so exciting.

I wrote an article about the power of matrix factorization vectors for music recommendations back in 2016: https://tech.iheart.com/mapping-the-world-of-music-using-mac...

We also discussed how we used convolutional neural networks (deep networks, but not transformers) to build vectors on the acoustic content of music: https://tech.iheart.com/mapping-the-world-of-music-using-mac...

forgingahead · on Dec 15, 2021

Thanks for the nice and clear summary!

At what point does it make sense to switch to using vector search solutions (compared to keyword searches)? Obviously Google et al need it, but for regular apps, maybe academic repositories and so on, is there a threshold in which we can start the discussion to switch?

Or phrased another way, when do we know we should be looking at vector search solutions to enhance the search in our application?

gk1 · on Dec 15, 2021

When your users start complaining that your search sucks. :)

Either because your catalog has grown to the point where a traditional keyword search makes it harder to find relevant items, or because users are increasingly expecting their apps to just "know what they mean" (like Google, Spotify, Amazon, and Netflix do).

forgingahead · on Dec 15, 2021

I run a super niche academic database - there is often that expectation that "search should know what I mean", yet there isn't enough data (in my opinion) to make any semantic search meaningful. There are about 23k data objects that can interlink, 50% of those belong to one specific data type, the rest are split.

So we've stuck with simple keyword search with filters to drill into specific categories within results, all pretty vanilla on a relational database structure.

Just wondering if there is a "volume" heuristic to this - I'd like to explore this more but realistically sometimes the academic user-base has big dreams with severe practical limitations.

jamesbriggs · on Dec 15, 2021

You can train a good sentence transformer on ~10K sentences using TSDAE (an unsupervised training approach), covered in Chapter 7 here: https://www.pinecone.io/learn/unsupervised-training-sentence...

unbanned · on Dec 15, 2021

Why 768? What are these dimensions

laingc · on Dec 15, 2021

It comes from the dimensionality of the hidden state of popular NLP models. The number 768 in particular comes from (if I recall correctly) the largest of the original BERT models.

unbanned · on Dec 15, 2021

So it's arbitrary then

Scaevolus · on Dec 15, 2021

No, those are standard full text indexing datastruftured of vectors representing word positions.

This is for finding nearby N-dimensional points, where N is typically greater than 50 and the point is the output of an ML or NN process.

animanoir · on Dec 14, 2021

Suddenly vector search is relevant. Was this orchestrated?

gk1 · on Dec 14, 2021

This happens a lot. Usually starts with one popular post about a topic. Then someone explores deeper and finds another interesting article on the same topic, and submits it.

sangnoir · on Dec 15, 2021

It's an old, mutually beneficial PR arrangement known as the Newton-Leibniz calculus

_pvxk · on Dec 15, 2021

It has happened that I have read about one article and find something related (following links or looking things up) that I find even more interesting and then submit that.

tonetheman · on Dec 14, 2021

I feel the same. It is not by chance I would think.

ladon86 · on Dec 15, 2021

Not everything is a conspiracy.

Google announced their new vector search service today, and that prompted someone on HN to post a related library that they knew about. It’s certainly possible that that person works at FB, but that doesn’t mean it’s some kind of orchestrated PR maneuver.

threatofrain · on Dec 14, 2021

Another HN post from today on vector search.

https://news.ycombinator.com/item?id=29554986

prophesi · on Dec 14, 2021

And I'm wondering if these are all making the front page because of this Show HN from earlier.

https://news.ycombinator.com/item?id=29551947

dmitriid · on Dec 15, 2021

How does on update the index? All I see at a quick glance is how to create/re-create/train/re-train the dataset from scratch

s-xyz · on Dec 15, 2021

Funny to see this thread, as I recall Google just recently posted a similar vector search library.

amelius · on Dec 15, 2021

But nobody is posting benchmarks.

monkeybutton · on Dec 14, 2021

Are Facebook's look-alike audiences implemented using faiss?

sydthrowaway · on Dec 15, 2021

What is vector search

gk1 · on Dec 15, 2021

https://www.pinecone.io/learn/vector-search-basics/

amelius · on Dec 14, 2021

Show me the benchmark results!

timdaub · on Dec 15, 2021

fuck facebook*

*Facebook is a bad company that puts profit over people and they should be boycotted. People that work at Facebook are complicit to its policy and so also their publications should be boycotted.