Can anyone please suggest a good stack for the following:
- calculating text embeddings using open-source/local methods (not OpenAI)
- storing them in a vector database. I'm confused by the myriad of options like Chromadb, Pinecone, etc.
- running vector similarity search using open-source/local methods.
Also, how granular should the text chunks be? Too short and we'll end up with a huge database, too long and we'll probably miss some relevant information in some chunks.
Has anyone been able to achieve reliable results from these? Preferably w/o using Langchain.
`ankane/pgvector` docker image is a drop in replacement for the postgres image, so you can fire this up with docker very quickly.
It's a normal postgres db with a vector datatype. It can index the vectors and allows efficient retrieval. Both AWS RDS and Google Cloud now support this in their managed Postgres offerings, so postgres+pgvector is a viable managed production vectordb solution.
> Also, how granular should the text chunks be?
That depends on the use case, the size of your corpus, the context of the model you are using, how much money you are willing to spend.
> Has anyone been able to achieve reliable results from these? Preferably w/o using Langchain.
I think I’d need to fine tune the model to see better results with some domain-specific terms, but I couldn’t find much information about how to actually do that - what sort of input data you need, how much of it, etc.
Would be interested to hear if anyone had more to share about fine tuning these models for semantic search.
Open source and the way it works is that you give an instruction on the type of task or even domain you want the embedding to be tailored to and embeddings change depending on the instructions.
It has to do with the architecture of the network used to create the embeddings. The embeddings is actually the output of the final layer of the model. The dimensionality is a function of the number of parameters in that layer.
Different models/architectures will produce different dimension embeddings.
Calculating the embeddings is probably going to be an application-specific thing. Either your application has reasonable pre-trained encoders or you train one off a mountain of matching pairs of data.
Once you have the embeddings in some space, for PoC I’ve mostly seen people shove them into faiss, which handles most of the rest very well for small/medium datasets:
https://github.com/facebookresearch/faiss
You don’t need to train anything if you just need embeddings. The data is text. You apply the pretrained model to your text and it returns the embedding. You save it in a vector database if you’re fancy, or a big numpy array if you’re like me. Then run your similarity search (cosine, Euclidean, etc).
a lot of embedding models have poor performance on domain specific data that is only mitigated with finetuning. alternately the instructor series mitigates this by fine-tuning the model on instructions and giving specific instructions to targeted domains.
You might want to give Haystack a try (disclaimer: I work at deepset, the company behind Haystack).
Haystack allows you to pre-process your documents into smaller chunks, calculate embeddings and index them into a document store. You can wrap all of that in a modular pipeline if you want.
Next, you can query your documents using a retrieval pipeline.
Regarding document store selection: Replacing your document store is easy, so I would start with the most simple one, probably an InMemoryDocumentStore. When you want to move from experimentation to production, you‘ll want to tailor your selection to your use case. Here‘s a few things that I‘ve observed.
You don’t want to manage anything and are fine with SaaS -> Pinecone
You have a very large dataset (500M+ vectors) and you want something that you can run locally -> maybe Qdrant
You have meta data that you want to incorporate into your retrieval or you want to do hybrid search -> Opensearch/Elasticsearch
Regarding model selection:
We‘ve seen https://huggingface.co/sentence-transformers/multi-qa-distil... work well for a good semantic search baseline with fast indexing times. If you feel like the performance is lacking, you could look at the E5 models. What also works fairly well for us is a multi-step retrieval process where we retrieve ~100 documents with BM25 first and then use a cross-encoder to rank these by semantic relevance. Very fast indexing times are a benefit and you also don’t need a beefy vector db to store your documents. Latency at query time will be slightly higher though and you might need a GPU machine to run your query pipeline.
This is what we use: BERT sentence transformers to generate the embeddings (we used Universal Sentence Encoder before that and it was good too), and ElasticSearch for storage, which has a dense vector data type. It also has a cosineSimilarity function to run searches.
I would not use Elastic for vector search due to its architectural limitations and poor performance when conducting vector search. https://zilliz.com/benchmark
I should’ve said that we were already using it for actual search where embeddings similarity is just one component of the overall score. For pure vector stuff a dedicated solution will be faster.
Hi, if you're looking into vector storage and querying, there are several things to consider. For lightweight usage, you can directly use FAISS without any database overhead. For heavy usage, Milvus/Zilliz is the most production-ready solution.
Also, here's a benchmark that allows you to easily test their performance differences through a user-friendly interface. This includes both cloud solutions and open-source options. If you prefer to view pre-tested results, there are standard ones available as well. Check it out here: VectorDBBench. https://github.com/zilliztech/VectorDBBench
Depending on your use case (particularly if it is research-oriented), "scipy.spatial.distance.cdist" and "scipy.spatial.distance.pdist" are your friends. If you are doing something in production, the PG extension seems like a good bet.
One way to potentially answer your question about text-chunk-granularity is to take a random sample of 500 pieces of chunked text and look at several "most similar pairs." Do this for a few different chunk-lengths and you'll see how much information is lost...
And when you query, you generate embeddings for your query and run a knn vector similarity search.
It uses some embeddings I generated with openai. You could use something like easybert or one of the many OSS embeddings models instead. Basically you need some code that converts your text/images/whatever into lists of numbers using such a model.
So:
1) use some magical tool that given a thing returns embeddings. You use this to extract embeddings at index time from your content and at query time for your queries.
2) put your embeddings along with your things in an Elasticsearch index (or vector db of your choice, Opensearch works similar to Elasticsearch for this)
3) when querying, create embeddings for your queries and find the nearest match.
I built this tutorial as a quick POC to figure out how easy it is with my own library. I'm not an expert. Mission accomplished and it only took me a few hours. The results are not impressive as this model is probably not very appropriate for the demo content. But it vaguely works. There are a bunch of people that are smarter than me that suggest that most oss models struggle to outperform bm25, which is just doing simple text searches.
Btw. the embeddings are the hard part. The rest is just plumbing. And of course world + dog just glosses over that. There's an interesting article that I came across recently that goes a bit more in depth on this: https://blog.metarank.ai/from-zero-to-semantic-search-embedd...
If you're just starting out, I'd use sentence-transformers for calculating embeddings. You'll want a bi-encoder model since they produce embeddings. As the author of the blog, I'm partial towards Milvus (https://github.com/milvus-io/milvus) due to its enterprise and scale, but FAISS is a great option too if you're just looking for something more local and contained.
Milvus will perform vector similarity search for you - all you need to do is give it a query vector.
You could use Marqo, it is a vector search engine that includes the text chunking, inference for calculating embeddings, vector storage, and vector search. You can pick from a heap of open-source models or bring your own fine-tuned ones. It all runs locally in docker https://github.com/marqo-ai/marqo
This is not always good advice. Many people require to not use off premise models, due to data ownership issues.
I would therefore suggest a better default for this, such as BERT+Qdrant.
It would be wonderful if there were a simpler (single file, SQLite or DuckDB like) database for vectors than the complex (and in some cases, unfortunately cloud-based) ones available now.
> It would be wonderful if there were a simpler (single file, SQLite or DuckDB like) database for vectors than the complex (and in some cases, unfortunately cloud-based) ones available now.
Admittedly, I don't know much about chroma, but it seems similar to Qdrant to me. Perhaps I'm missing something.
It doesn't appear that it stores everything in 1 singledb file, but rather a plethora of files in some directory. Although it does appear that it runs local, which is a huge plus.
70% of the way through this article they drop "Since we have unit norm vectors ..." and later "Always remember to normalize your embeddings." I found this strange and surprising.
It seems to me that in a semantic embedding, "big", "huge", "enormous", and "gargantuan" should roughly point in the same direction but have different magnitudes. For instance, I might assume that the nearest neighbor to "big * 10" is "huge" or "enormous". But if embeddings must be normalized, I either can't tell the difference between these four terms, or they must point in different directions. I can't even talk about "big * 10" at all. I mean, perhaps there are "emphasis" dimensions roughly corresponding to words like "very" and "extremely", but it still seems like "big" and "enormous" would be hard to distinguish without scalars. It seems to me that the only significant difference between "big" and "huge" is in fact their magnitude, and forcing "huge" to point away from "big" so that it can be non-zero in the "very" dimension must cause it to be less "big-like" in some other small way, right?
I'm surprised that throwing away all the power of scalars is worth it. I'm not a professional in the field, so maybe there's good reason for this that I just haven't read. Can any professional comment on why vectors of magnitude != 1 are not helpful in embeddings?
* The problem with large vectors is that they have large dot products with every other vector, which would imply that they are more similar to everything which doesn't make sense.
* Adding the requirement that "length==1" doesn't matter much in high-dimensional spaces, since that only removes one degree of freedom. Don't try to use too much 3D intuition here.
* It might be intuitive to think that "large" should have implications for the size of the vectors, but that really only applies to a couple of examples. We want vectors to represent thousands of unrelated concepts, so this one case is really not that relevant or important.
* In reality what ends up happening is partially the "very" dimension you're suggesting, but also just a "largeness" dimension. Individual dimensions can still have a scale!
Very good points here, especially about the fact that the single "length" degree of freedom is much less to lose in very high-dimensional spaces. However, I don't agree that large vectors would end up being "more similar to everything" -- really what's happening is that the dot product stops being a good measure of similarity, but we already knew that using it that way relied on everything being normalized anyway! L1 and L2 still work just fine.
How would I decide on which granularity level I create my vectors? For example, if I create word level embeddings with word2vec and store one vector per word I probably will only get good results for keyword based queries, right?
If I create more fancy vectors, for example on the sentence level, it is not really clear to me how that is supposed to work.
Why is the vector embedding of a query, which typically has the sentence structure of a question, near the vector embedding of a propositional sentence that answers that query? Sure, it will probably not be completely off, just for the fact that query and answer will contain similar words, but how can that be better than just a fuzzy word search?
And finally: Should I decide on on particular level (like sentences) and store that or should I store word2vec and sentence and paragraph vectors in the same collection?
I would not use word2vec in any application today - I used it in the blog post because it's a well-known model and because the embeddings it generates are static, i.e. one token always maps to the same embedding regardless of context.
If you want to create embeddings at the sentence level, a good place to start is SBERT: https://www.sbert.net/
> I would not use word2vec in any application today
Sbert needs to be finetuned to get anwhere near good results if the training objective deviates from the original one. Word2Vec can be estimated from collections of tokens in an unsupervised fashion, so it definitely has its place even today.
I understood that it was just an example and it is a good choice for an introductory text. My question is more into advice how to go on from that. For example, is SBERT a good choice if most of my queries are in fact multi-sentence paragraphs?
I guess what still doesn't add up in my mental model is how people seem to assume that the query embedding vector must somehow automatically be near good answer embedding vectors. I would have assumed that some conditions have to be met to make this true and I would be interested into advice in that regard, or alternatively an explanation why I'm wrong with my assumption.
You’re correct. There are lots of ways to project a string of embeddings back onto a single embedding - lots of times however these simpler methods do something like pooling the sum if the word embeddings by averaging and hoping that the answer embedding has a small angle with that projection (cosine similarity). You could try adding a way to focus on individual tokens, pairwise interactions, and more sophisticated ways of pooling and projecting embeddings, and then boom you’ve basically just created transformers.
oh boy, this looks like an alchemy to me. on one hand one cant deny success of llm's on another we are shifting the responsibiliteis to non-deterministic fuzzy duck-taped functions.
My impression is similar and that's why I asked the question.
Another thing that is not clear to me: Is there query directly fed into e.g. SBERT or should I ask an LLM to transform the query into something more suitable, like turning the question into a proposition?
Asked more abstractly: In a vector space like SBERT's, can I expect questions and answers about the same topic to lie near each other? Especially will the correct answers lie near their question?
>Another thing that is not clear to me: Is there query directly fed into e.g. SBERT or should I ask an LLM to transform the query into something more suitable, like turning the question into a proposition?
This not how 99% of embedding models work(though you can train for specific tasks) but as it turns out such a thing is possible and is beneficial.
This kind of thing is going to seem so obvious in hindsight in six months once all the sota methods start converge upon similar improvements lol. It’s still so early, I keep telling myself.
I’m confused by the choice of word2vec for the embeddings, for “simplicity”. Using a transformer based model like Universal Sentence Encoder or SBERT is just as easy and the results will be considerably better.
word2vec is seminal and outputs static embeddings - fairly easy for beginners to understand. The embeddings for each token change relative to context in attention-based MLMs and I figured it might be confusing for an introductory blog.
How do vector databases respond to changes in embeddings? Does one need to reindex all documents in the DB when that happens (i.e. retraining a model, switching to a model with different embeddings etc.)?
Yes, any change in embedding models will make the embeddings incompatible (they may still work somewhat if the model architecture is similar), but unless you have massive amounts of data the embedding itself generally doesn’t take too long. 10k documents/minute running sentence transformers locally on my m1
Can someone please answer: Which layer is best used for similarity search? The initial input embedding? Input embedding after positional encoding is added? The output layer? Some hidden layer? It's not clear to me why one would be better than the other.
- calculating text embeddings using open-source/local methods (not OpenAI)
- storing them in a vector database. I'm confused by the myriad of options like Chromadb, Pinecone, etc.
- running vector similarity search using open-source/local methods.
Also, how granular should the text chunks be? Too short and we'll end up with a huge database, too long and we'll probably miss some relevant information in some chunks.
Has anyone been able to achieve reliable results from these? Preferably w/o using Langchain.