Not that easy, the founder was a director at AWS. This is just devops/obfuscatio...

gk1 · on Aug 16, 2022

Pinecone doesn’t use Faiss, nor ScaNN. We love Faiss and even teach people to use it[1]. There happens to be a sizable population of engineers who need more than what Faiss provides (like live index updates and metadata filtering, for example), and can’t be bothered or aren’t being paid to customize and manage open-source libraries all day.

[1] https://www.pinecone.io/learn/faiss/

fdgsdfogijq · on Aug 16, 2022

So you guys developed and implemented state of the art neural network vector search from scratch? in a year? and something better than libraries with tens of contributors over years of research?

pbadams · on Aug 16, 2022

Most vector search research teams are a lot smaller than you suggest, and haven't been around that long (e.g. the FAISS paper was published in 2017).

From public info, you can see they have at least one researcher working there. It's believable to me that they could have some new innovations, especially since the product space they're focusing on is different from other teams working on vector search. State-of-the-art for a specific set of constraints is still state-of-the-art.

However, considering how much of their edu-marketing content is posted to HN, it would be great if they could share more details about the internals of their index with the community. One of the great things about vector search is how many techniques are open sourced or documented in papers :).

Disclaimer: I work on vector search at a different company

nl · on Aug 17, 2022

Many very competitive vector search libraries are done by small teams.

HNSW in NMSLIB[1] is mostly 3 people's work and it's very competitive[2].

[1] https://github.com/nmslib/nmslib

[2] http://ann-benchmarks.com/glove-100-angular_10_angular.html

noogle · on Aug 16, 2022

I actually built a similar solution supporting similar operations (including filtering by meta-data) using open-source libraries. Took me about 2 weeks net.

I can see a clientele for such database (people who want a turnkey solution), but honestly it looks like an attempt to use a dev-ops solution to address deeper issues with problem formulation: e.g.

1. Is there really a need to search all items in the database? can subsampling make simple similarity comparison feasible?

2. Do the embeddings really need to have that many dimensions? Can we reduce their dimensionality and fit them in RAM?

3. Is embedding accurate enough compared to pairwise comparison? Can we formulate the problem to make the latter feasible?

I also could not find any explanation of the underlying algorithms, especially around meta-data filtering, which is not solved by FAISS as well as their accuracy. (happy to hear otherwise)