I'm skeptical about some vector databases these days, but your article misses a few import points when it comes to LLMs.
1. To use LLMs effectively, you often need to generate and store more than 1 vector per document. 10 million vectors may only be 100,000 documents. This may still be enough for alot of small problems.
2. Pgvector currently has great limitations on recall/latency because underlying its ANN its using IVF (I'm currently working on adding HNSW-IVF and HNSW support to PGVector). In some cases, even elasticsearch can have issues with scale (the problem comes from the constraint of one ANN index per index segment, and immutability).
3. Pre-calculate seems like the wrong word to describe HNSW graph construction.
I think a point you miss that is important to consider for LLM + vector DBs is the fact that so much of the complexity of these uses cases cannot be captured by the vector DB (e.g. pinecone, chroma, qdrant, etc). I think there are some more end to end systems, at least in search, attempting to solve this (e.g Marqo, maybe Weaviate). Overall, I like the article. It makes a worthwhile claim and counterpoint to all the vector DB hype.
I'd love to hear more about your thoughts on the complexity that cannot be, in your opinion, captured by the vector DB. I probably didn't get your point.
Disclaimer: I work for Qdrant, and we believe a database should be just a database. I remember attempting to move logic to the database layer and coupling neural encoders into the vector database sounds the same.
> 1. To use LLMs effectively, you often need to generate and store more than 1 vector per document.
Could you elaborate this point for me? What would cause the 1 document -> ~ 100 vectors blowup; do you store vector embeddings for sections of the document, or use multiple models to create several types of vectors?
If you look at something like LangChain[0], it supports/recommends splitting larger documents into smaller chunks. In this way, when doing something like semantic search you can get the specific paragraph/section that holds the closest relevance, rather than having to read the entire document again (think of a 100 page PDF).
This is generally very context/use case specific. In general, if a document is a `Dict[str, Any]`, then you either have to have one (or multiple) vector(s) per field, unless you want to combine vectors across fields (it's not self-evident how you'd best do that). In saying that, specific reason's to do this (or why I've done it in the past).
1. Chunking long text fields in documents so as to get a better semantic vector for them (also you can only fit so much into an LLM).
2. Differently to 1. chunking long text fields (or even chunking images, audio, etc), is one way to perform highlighting. It helps to answer the question, for example, for a given document what about it was the reason it was returned? You can then point to the area in the image/text/audio that was most relevant.
3. You may want to run different LLMs on different fields (perhaps a separate multi-modal LLM vs a standard text LLM), or like another comment said have different transforms/representations of the same field.
Perhaps 100 vectors is non-standard, but definitely not unseen.
Only Vespa allows you to index multiple vectors per schema field, avoiding duplicating all the meta data of the document into the "chunk", and avoids maintaining the document to chunk fan-out. See https://blog.vespa.ai/semantic-search-with-multi-vector-inde...
I’m not a data scientist but I think I know why one document could lead to many vectors.
(Happy to be corrected and/or schooled.)
A vector is a list of numbers each of which represents weight accorded to a certain word along a certain dimension.
Let’s take an example.
Is an “apple” a “positive” or a “negative” thing? Most people would associate positivity with apples. So, for the general population, the vector for “apple” along the 0-1 continuum where 0 represents negative sentiment and 1 represents positive sentiment would be something like [0.8].
Let’s add one more dimension. Is an apple associated with computers (1) or not (0)? For the majority of the world where Windows has a massive market share, “apple” would recall a fruit, not a sleek laptop. Therefore, the vector for apple along the computer/non-computer dimension is probably [0.3].
Taking this together, apple = [0.8, 0.3] where positionally, 0.8 is the value for positive/negative sentiment while 0.3 is computer/non-computer.
Agree?
(Hoping you do)
But that [0.8, 0.3] vector is for the general population.
Would a bible literalist who publishes blogs on bible stories feel the same way?
For someone like that, the notion of the original sin could taint their sentiments towards the apple. So they might weight an apple at 0.2 on the positive/negative line. Since they’re bloggers, it’s more likely they associate apple with computers so they might call it 0.5. Therefore, their apple vector is [0.2, 0.5].
Extend this to more content and you’ll see why there are more than one vector.
At least that’s how I understood it. Happy to be corrected and/or schooled.
In my opinion, you could represent "apple" as a vector, for example, [0.99, 0.3, 0.7] in relation to [fruits, computers, religion]. Then, you can create different user vectors that describe the interests of various groups. For instance, the general population might have a vector like [0.8, 0.2, 0.1], geeks as [0.6, 0.95, 0.05], and religious people as [0.7, 0.1, 0.95].
By creating these user vectors, you can compare them with the "apple" vector and find the best match using ANN. This approach allows you to determine which group is most interested in a given context or aspect of the word "apple." The ANN will help you identify similarities or patterns in the user vectors and the "apple" vector to find the most relevant matches.
I don’t know what ANN is but your comment raises two questions in my mind -
1. Where did your first vector of [0.99, 0.3, 0.7] come from? You later present the concept of user vectors which are vectors for different cohorts of users but don’t name the first vector as a user vector.
2. I feel my example of vectors for “general population users” and “bible literalist blogger” user aligns with your “user vector” concept.
One thing others didn't mention is that "document" is a general term but in some cases (e.g., question answering) the typical document can be a very short paragraph and take much less memory than the vector. Also note that with some ML architecture the vector is very large (e.g., an entire very layer output)
1. You make a great point about longer documents requiring multiple vectors which I should've mentioned in the post. Depending on your use case, this can certainly explode your dataset size!
2. Good to know about the pgvector limitations -- I haven't used it yet.
3. I guess "index" would be the more database-y term. That said, one thing I'll call out is that you have to re-index if you ever change your embedding model, and indexing can be slow. It took me ~20-30 minutes to index the 10 million embeddings in my benchmark.
I'm interested if anyone has some hard data on the "best" size of the document "fragments" that are used for embedding into a dense vector.
Obviously, embedding single words probably aren't particularly useful for reassembling portions of a document for submission to an LLM in the prompt. I'm currently pondering on what size of string is best for embedding, and considering a variable size might be one option.
Testing with strings around 512 characters seem to do pretty well, but it may be storing multiple lengths of similar runs in the document might be a better way to do it.
Yeah, depending on the model, calculating the 10 million embeddings could take longer sequentially, but, as you mention, it's also an embarrassingly parallel operation. I don't think that indexing can be performed in parallel, but I may be wrong on that one.
I think you can and it has some benefits. One interesting thing that can help is to store representations from transformations over the document and then "fuse" the vectors (i.e. average them) at indexing time. You are effectively able to do run-time augmentation but without any extra inference overhead at query time and without increased memory. An easy way to think of this is for similarity measures that are linear (i.e. dot product) you are now scoring the document over a weighted sum of the transformations of the document. Test-time augmentation is a very well known method in ML generally for improving performance and is applicable here. You can do the same for queries as well - akin to query expansion.
I'm a hobbyist in the field, but I concur and fell into that trap.
I wanted to index my photos with some CLIP, and my complexity mind said that I wanted an under-linear database, because that's what databases are for.
I took a bit of time around spotify's annoy for that. Then the first results were meh, and I was like "oh frack how do I know whether the issue is in the approximate NN or the CLIP model?".
Then I realized I was worried about linear complexity for a 5MB database, laughed, rewrote into a dumb for loop in a jiffy, and could conclude within the hour.
I think, it is really underrated how much faster numpy approach is for low-key workloads like personal images and videos where you would generally have a few K embeddings and it is also possible to hide some latency by sharding and stream results as soon as a shard is done. We work on a similar project[0] to make it easy to index personal images and videos. Before using numpy, we also looked at libraries like Annoy to index the embeddings, but being approximate sometimes such libraries would leave out most-similiar image/frame. For personal data, we found it is better to depend on exact-search rather than an approximate one even if it comes with a speed tradeoff.
I've been playing with Hachi this afternoon, as it's a more developed version of something I've been hacking around with on and off for a while, and there's little point is duplicating the effort. Thanks for the release - first impressions are very positive!
There are a couple of things I'd like to discuss/suggest, and also FWIW I get "Face Recognition not available" when I try that feature.
Should I use Github, or is there a better way to contact you?
Hey, thanks for checking it out.
We also started working on it for personal reasons. Any suggestion or feedback is welcome, you can contact me directly at anubhav@ramanlabs.in .
Some features like Face-recognition and improved video-search are available as premium features for now but working on to let users preview those features through a demo.
For smaller documents (but also entire books) you can even perform everything in your browser with JS without any DB or backend using transformers.js. It's surprisingly fast.
For context injection, don’t we just want the best tool for the job to find relevant snippets from the source material?
This isn’t exactly a new problem, just about any content management system will have this capability already. If a vector database is better for search for context injection, shouldn’t we start seeing cms products adopting it for their native search features?
It's possible that Vector databases may end up like "feature stores". Never really caught on in industry. Didn't do anything that current databases couldn't do. Ultimately you still have the storage layer and the query engine, probably with an analytical optimization layer in between to speed things up.
Thoughts on Vespa.ai? It's been around for several years, maintained by Yahoo and is a hybrid search engine (vector + metadata search). I don't see it mentioned nearly enough as Pinecone, Weaviate, Milvus, etc. and I'm not entirely sure why.
Disclaimer, I'm a developer working on the Vespa.ai project. One reason is that we simply don't have DevRel teams or marketing teams, but we still have decent interest, from large companies like Spotify using Vespa in production for semantic search (at scale) https://engineering.atspotify.com/2022/03/introducing-natura...
On topic: IMHO a better title of the blog post would be:
Do you actually need to enable approximate vector search?
Which is an excellent question, because, introducing approximation also introduces accuracy degradation, plus that you need an ANN algorithm, which also has tradeoffs.
In Vespa, you can choose between exact and approximate, with related tradeoffs such as recall (accuracy), search speed, indexing throughput and resource footprint. Just start easy with exact search, then it's just change the schema definition of the tensor field to include `index` and Vespa will build the necessary data structures for enabling approximation.
I quickly read this article. Two things stood out: the big-O complexity discussion and the fact that embeddings get really complicated for large datasets like 100M Spotify songs.
I'm more curious about embeddings for small datasets. For example, if you have a very small set of information, say the transcript from a meeting, can you then run a very context specific set of generated embeddings against it and generally not worry about the performance issues? It seems like these things are important only when doing massive scale embeddings and massive scale of data.
Am I wrong about the implications of all this? And the takeaway is that postgres is probably fine for all usage like that, rather than a specialist database.
I think the real question is whether you need a hosted solution like Pinecone, etc, or whether you can run a library yourself (like FAISS). The latter won't cost as much, and is pretty simple to get up and running, but may not scale the same way.
There are some other options in between as well. FAISS is a library, so not suited well for production usage unless a single machine is enough. The variety is wider than SaaS vs library. Tools such as Qdrant or Weaviate are Open Source.
And Qdrant might be launched without spinning a server, as long as you use a Python client. So you can actually start locally with in-memory mode, run your dev environment on-premise and then prod on Qdrant Cloud, if you prefer.
This flexibility is a big draw. I can experiment with in memory, launch with cloud, and move to my own infra if I’m lucky enough to need that kind of scale.
So from what I've seen pinecone db can't handle more than 700 dimensions, but OpenAI's embeddings model has around 1500 dimensions. Are people expected to perform dimensionality reduction to use popular vector databases?
Hi, I'm from Pinecone. We support up to 20,000-dimensional vectors.[1] A meaningful percentage of our customers use OpenAI models. If you got that 700 number from our site or docs I'd love to know where so we could correct it.
Each pod has a limit to how much data it can store. We want to give an example of how many vector embeddings you can store in a single pod. Since the data size is affected by number of dimensions, we have to choose a sample dimensionality for the example. We chose 768 since that was the most common last year (SBERT), although it might make sense to change it to 1536 (OpenAI).
I think it’s confusing because as an end user I don’t yet know what a pod is or why they’re needed. I’m guessing this is related to space partitioning?
It would probably be helpful to link to some docs on pods (are pods a leaky abstraction?) and first list the max size in number of floats or bytes and then say how many embeddings that would be with different models.
I have experimented with both pgvector (through supabase) and elasticsearch.
Both are great from a dev UX perspective and allow you to get the ball rolling quickly. You still have to generate the vectors yourself, so your LLM+VDB is only as good as your embedding algorithm. (I tried both OpenAI embeddings and using a random BERT-based model from huggingface).
I would say if you use either of PG or ES already in your stack, then as the article says they're very decent solutions. Better than adding a new piece of infra to your stack.
It really boils down to which stack you are familiar with / which stack integrates better with the rest of your infrastructure / use case.
If you're already ingesting a lot of data to ES and you want to vectorize it, it has a good support for cosine similarity in its indexes.
If you store data in a PG database and you want to make it searchable by similarity, then pgvector is a good choice. It's especially powerful coupled with the ease of use of the supabase platform. You can make a document-based chatbot very very quickly.
In both cases it is more of a datatype and a lot of your logic will still reside in your application layer.
In my case I was already ingesting data into elastic, so I just added a dense_vector property to my index, and a vectorization step in my external code by calling the openAI api and saving the result into dense_vector.
In the future, I'm planning to build an AI powered webapp and my stack of choice will be supabase + pgvector because it's a better option as a public app backend.
> It's especially powerful coupled with the ease of use of the supabase platform
Just as you said, I followed their Clippy tutorial (which utilizes pgvector and OpenAI embedding) and was able to spin up a document-based chatbot quickly. In my specific use case, I stored portions of my knowledge base as embeddings in a normal Postgres table. The next step would be to implement a Row-Level Security for extra security (in case I screwed up somewhere). Thankfully, all my data and auth info are integrated with Supabase, so it's straightforward to do.
If you are doing LLM, of course you do. But an amazing OSS vector DB has yet to arrived. The number of vectors it could handle needs to be in the billions, at least.
If I wanted to do that I'd use PgVector, since I use Postgres for just about everything. There'd need to be a really good reason to go with a specialized DB.
The ANN index IVF implemented in pgvector has very poor performance, with only around 50% recall. Is it a good enough reason, or you don't care about results accuracy in favor of the comfort of using a multitool?
You can do everything with PostGres:
Full-text search, but there are better engines for it: Elastic, Meilisearch, etc. right?
You can also store JSON into Postgres, but you should better use MongoDB for NoSQL purposes, right?
The reason for this is: dedicated tools are always better, faster, and more feature-rich.
> You can also store JSON into Postgres, but you should better use MongoDB for NoSQL purposes, right?
It's a common folklore at this point that Postgres is sometimes a better document store than most NoSQL databases (including Mongo), see for example this which is also at the front page of HN today, https://news.ycombinator.com/item?id=35544499
> The reason for this is: dedicated tools are always better, faster, and more feature-rich.
Depends. Polyglot persistence has the benefit of letting you use the "best tool for the job" for each job but that benefit can fall apart if you have cross-cutting concerns. If you need to query across different storages you often end up compromising on several of the initial benefits (e.g. performance from passing data between storages or resource use and consistency from duplicating critical data).
For example, you could store your graph data in Neo4J and your document data in MongoDB but good luck doing graph queries that need to access the document data. OR you could use something like Tiger or Arango that's a graph database that can also store data other than pure edges.
“Dedicated tools are always better, faster, and more feature-rich.”
Yes, most people are perfectly OK with just one car.
Specialists need special cars, but general public most of the time is ok with a family car.
PgVector might not be production ready, but it doesn't mean that for most of the situations Postgres's full-text, JSON, GIS, Graph-walking, Queues etc. isn't good enough with advantages of using just one database. There's a new category of problems when you let your data be in multiple places. On top of that, when you go full into Pg, with things like PostgREST, you can sometimes end up with a very minimalistic backend.
It is very common for production data to live in multiple places; one for transactions, one for analytics, one for serving, one for ETL. The company I worked at rolled its own similarity search engine before these vector databases were a thing.
> The ANN index IVF implemented in pgvector has very poor performance, with only around 50% recall.
My understanding is that this mostly due to the default settings that pgvector uses (nprobes = 3) and not due to the usage of IVF. The recall would improve significantly with better defaults. This of course would also increase the latency of vector searches, but that is the trade-off of using IVF instead of HNSW (worse latency at high recall, but much lower storage/memory costs).
It could be that pgvector isn't good enough for serious use. It would be great to see some benchmarks. The one you site sounds like there's enough reason to try something more specialized.
Having many point solutions is problematic from a cost and complexity standpoint. General purpose solutions that would work for 80% of your use cases would be better long term. Having said that, I don’t think Postgres can be used as a general purpose solution to cover vector search, full text search and NoSQL use cases. The best general purpose solution would expose a unified API but under the hood use different storage engines to support these diverse vector search and full text search use cases.
Tools like ChatGPT have a limit on how many tokens can fit into the context. So, let's say you want to converse with it about a specific book – and you want to minimize how much it hallucinates, or 'makes up' stuff about the book. You can't fit the entire book into the chat context. But, if you embed, say, each paragraph of the book as a vector then you can do the following:
+ Take the user message and embed it
+ Do cosine similarity between that message and all of the vectorized passages of the book to pull up the most relevant passages to the user message/question
+ Put just those passages into the context
+ Get a response from ChatGPT using that specific context.
If you happened to see the "Generative Agents" paper that made a splash this week, this is the technique they used as part of the long-term memory retrieval and 'reflection' loop for their 'sims'.
The use case is creating embeddings for paragraphs from documents. Then, when the user issues a query, you create an embedding for the query (you can also do more complicated stuff), you find the most similar embeddings from the documents and you insert those as context into ChatGPT so that it can rephrase the documents to answer the users question.
This is fascinating, thanks for explaining this use case. It sounds like it's a fancy way to get a generic LLM to understand your domain by including the relevant context in a prompt.
What about specializing the LLM in the first place? Could you fine-tune the model by training it on those internal documents instead? Then you don't need to mess about with the prompt, right, since the LLM has "learned" the data?
With fine-tuning you are training on a couple hundred thousand tokens, compared to the billions of tokens the original dataset the LLM was trained on. It won't have as much of an effect. Maybe the original dataset includes a company policy from 5 years ago and now you fine-tune on the new policy. It's hard to guarantee what the model will spit out.
By giving the in-context prompt, it's more likely the model will choose the information you just gave it.
Of course, there are use cases where fine-tuning is worth it. For example, someone recently finetuned a model on their Messenger group chat that was going on for years, about 500k messages, and now they have a model that can imitate all the members of the chat.
In context learning (i.e., tuning the LLM by adding all relevant info to prompts) seems to be preferred over fine-tuning. I'm not sure what exactly the reason is, perhaps because fine-tuning often doesn't make sense in a dynamic setting since it's fairly expensive? On the other hand, the only reason the entire pipeline for vector DB exists is because context size is relatively limited.
I'd love to see a comparison between the two in terms of accuracy of the outputs and the degree of hallucinations.
Most LLM models are at a very high level very similar to Word2Vec in terms of inference on the encoder/non-generative side. Both convert a word/token into a vector representation. GPT does it contextually with all the other words in the input while Word2Vec does it independently for each word. One difference is that a LLM model can also create an "summary" embedding of all the words that's more than just a mean/max of the individual word embedding.
So you can, for example, use a LLM to convert a document into a vector and then also use it to convert a search query into a vector. Then you can find the closest document vectors to the search vector.
Current llm can only handle so much text per generation, are generally slow, and often you pay per token, so there's an incentive in using them sparingly.
If you have a large text corpus to search that means restricting the search field with a query first, then feed the result on llm for extracting facts.
So everything hinges on having an excellent search index to reduce the search space, and the best tool we have as of today is running a semantic search over representations of a text's topics in embedded format.
The typical use case is getting around prompt size limits for information retrieval tasks over large corpora. If you want to ask a question of a large corpus, you first use nearest-neighbor search in embedding space to retrieve snippets likely to contain the answer, and then stuff those snippets into the prompt.
For example, you can look up semantically relevant ( to a user query ) paragraphs from some internal document. Then, include them in the LLM context so it would know how to answer the query. Basically, that's the idea behind many ChatGPT plugins.
This is key. Sure you do not need a specialized db to iterate over <10M vectors at a time, as the results show this is the order at which the latency (for this implementation) starts getting too big for a real time system: O(100ms). You can get clever with sharding and such to scale out - calculating cosine similarity and finding the top K is classic mapreduce - but that's just spreading the compute over more computers i.e. for >100M 256 dimension vector like the author chose, you'd be at several seconds of CPU time for each similarity calculation.
I don't think the vector database companies are targeting, at least with the expectation of making money now (maybe later if some grow a lot), people with that few embeddings. They are targeting use cases with WAY more vectors - think about how many vectors chatgpt must be generating for however many conversations it's producing just today, and if the technology continues to improve and use cases grow, how many other business may be generating even more embeddings. In the same way as it doesn't make sense to use Spark on a 10000 cell csv, or to buy a Tank to haul chips out of a grocer store, it doesn't make sense to use a dedicated database under a certain scale.
> Additionally, you get to save yourself the complication of standing up a vector database and waiting the ~100 seconds to index those million embeddings.
Isn't that a one time cost with the subsequent upserts amortized? Or they could even be done outside the hotpath in batches: if the db is just computing similarities to stored vectors, you could add a ton of new vectors to a new db instance/replica and then swap them.
I'm sorry if I'm being overly critical, but to me this seems like exactly what happens any time someone releases a product or technology that improves performance for really big workloads (so, almost every big data system or task-specialized db), and everybody says the technology is stupid because they don't need it at their startup with 1% of the data the technology begins to be useful for. Sure, but others have 100x your data and would benefit from it.
Let me just point out one example for why 10M vectors is actually quite small for some use cases: English wikipedia has 6M articles and it would take much more than one 256 dim vector to accurately encode some of those pages. Even with a single such vector per article you're at 0.3 cpu-s for each nearest-K calculation on an arbitrary page. And that's one admittedly large portion of an amdmittedly large website, but still a fraction of the datasize you'd get from crawling the web, looking at discord/fb/instagram/snapchat messages, reddit comments, etc.
Hi, author here. I totally agree with you that, for large scale, you're going to need a vector database. My hope is more to help people avoid scenarios like the one in this comment: https://news.ycombinator.com/item?id=35552303 Tangentially, I really like the approach that haystack has taken, where they allow you to slot in whichever document store you want, and that document store can scale from in-memory, to sqlite, to postgres, to pinecone https://docs.haystack.deepset.ai/docs/document_store
In terms of the one-time cost of indexing, you're totally right! Although, one thing to call out is that you will have to re-index every time you change your embedding model, such as for fine-tuning. I don't have a good handle on how prevalent this is, though.
I fully agree with you - but also with the article.
If I were to index my personal knowledge base, where I have on the order of 1000 documents, I'd need at most 100k embeddings (generously assuming 100 per document, but that's an overestimate).
I'm literally talking to a company at the moment who's interested in a chatbot to talk to their internal knowledge base. They have... 100 documents, around 40 pages long. Again, this will easily fit within 1 million embeddings.
And using vector databases comes with tradeoffs: you will get worse recall (aka you will sometimes miss the most relevant document).
1. To use LLMs effectively, you often need to generate and store more than 1 vector per document. 10 million vectors may only be 100,000 documents. This may still be enough for alot of small problems. 2. Pgvector currently has great limitations on recall/latency because underlying its ANN its using IVF (I'm currently working on adding HNSW-IVF and HNSW support to PGVector). In some cases, even elasticsearch can have issues with scale (the problem comes from the constraint of one ANN index per index segment, and immutability). 3. Pre-calculate seems like the wrong word to describe HNSW graph construction.
I think a point you miss that is important to consider for LLM + vector DBs is the fact that so much of the complexity of these uses cases cannot be captured by the vector DB (e.g. pinecone, chroma, qdrant, etc). I think there are some more end to end systems, at least in search, attempting to solve this (e.g Marqo, maybe Weaviate). Overall, I like the article. It makes a worthwhile claim and counterpoint to all the vector DB hype.