That's not so much a comparison, as it is a collection of bland facts about each solution. Those facts may not even be a good basis for making a choice and it doesn't give any guidance on why each of them may be important.
It also looses out on qualitative attributes that distinguish some of them from the others. E.g. Weaviate has a lot better DX (in my opinion) than any of the others as, as it handles integration of different vectorizers etc. a lot better, which makes it stand out.
Completely agree. A bunch of "facts" copied from the providers website. Funny they have a conclusion section. You could probably write something better with AI.
As an aside, I’m not sure if I’ve just made a low value comment. If someone comes to the comments first, I hope they’re informed of my conclusion and take that into consideration before clicking through. I wonder how dang feels about these sorts of comments.
I would suggest that anyone trying a real comparison of vector DB's consider the following
- necessary functions / use cases (eg prefiltering, dense search)
- embeddings version management
- anticipated embedding size (the article only considers glove-100 on ANN-benchmarks, which is quite different from openai-ada-002 1536 - both in terms of their output distribution and the vector size)
- required precision / recall
- required ingestion speed
- required ingestion throughput / time to ingest periodic updates
- required query speed (percentiles, not average!)
- required query throughput
- required RBAC, data privacy, active-active, etc
...and so much more. ANN-benchmarks is a good start for thinking about this but remember that actual throughput is quite different from whatever you see in the algorithms benchmarking!
Wouldn't "ease of putting into production" also factor in?
For many use cases, being able to put a proof of concept out of the door in hours vs days vs weeks is the top selection criterion if everything else is "good enough".
dev experience is very very important, but I think it's so subjective -- if you are really pushing the boundaries, maybe you want a super powerful platform which can use all the bells and whistles, but if you want to hack a weekend project together, maybe you just want some API calls.
Pure vector databases are a dead end. Almost every search engine (Vespa, Elastic, etc) and every database (Postgres, SQLite, Redis, etc) already has a solution for searching vectors in addition to everything else you need to query or search. If any of these vector databases become anything they will have to also implement either a full search engine or a full database.
MS desperately needs to get on this train with SQL. Maintaining and keeping a second system in sync to do vector search is painful. I've never been more jealous of people using Postgres.
As much as I like pg_vector, I think right now what we need the most is a pre-packaged version of sqlite-vss and a Pythonic wrapper for bootstrapping projects. This would lower barriers to entry even more for those using LLMs solely via APIs, and save people the trouble of setting up a database server or risking getting locked in to yet another prickly SaaS while iterating on a concept.
Scaling can come later, after the solution has proven its worth.
I have prepared this comparison table to help me choose a vector database. I am sharing it here, hoping it may assist you in your projects as well. Main comparison points: cost at scale, compliance, and queries per second (QPS).
+1, I've been using OpenSearch (basically Elasticsearch 7.0), and have been pretty happy with the setup so far.
OpenSearch specifically has an edge over Elasticsearch because it supports vectors up to 10k dimensions, whereas ES maxes out at indexing 1024 dimensions, which isn't enough to support OpenAI's 1536 dimension vectors.
And then there's the benefit of it being well documented / Q&A'd, and able to support regular searching, faceting, etc. as well.
Also if you want to do hybrid retrieval with legacy system in place elasticsearch is a good option. I would like to see some comparison for the hybrid retrieval as well.
You aren't supposed to index vectors larger than ~128 dimensions. Because of concentration of measure which is an aspect of the curse of dimensionality the distances between high-dimensional vectors tend to become identical.
You need to do dimensionality reduction before indexing. Basically it's fine to just pick n first components if you don't want anything fancy.
I can also add one more data-point in favor of Elastic / OpenSearch. They benefit from a long history of providing search-specific features. Including the ability to write custom re-ranking functions to combine the benefits of traditional TF/IDF style search with the modern benefit of vector search techniques. And you can easily use OpenSearch with state of the art open embedding models like SGPT that use 2048-dimensional vectors. Plus, it is designed to be highly scalable and distributed.
Given how well OpenSearch works and scales, I would find it hard to justify a specialized vector-specific database unless it brought A LOT of new benefits to the table. And I am not currently aware how any of them would actually do that.
Also, OpenSearch provides all of that out-of-the-box. You just configure a vector field mapping and start inserting your data. No need for an add-on plugin/extension. It just works.
You should test yourself on your own use case (eg vector dimension, prefiltering, throughput, target latency). In my testing, using identical HNSW configs between OS and a purpose built vector DB, I saw 10x+ better performance with the vector DB, despite much smaller CPU usage for vector DB and even including internet latency for the vector DB (but not OS).
This may not matter if you are not doing high throughput / have tight latency requirements, but in my case, it did. Of course you should weigh that versus the convenience of preexisting ES/OS clusters and so on. You can also use ES/OS together with a separate vector DB. (these tradeoffs are, of course, what make a static benchmarking post like this one so hard to think about).
Follow that link - Elastic had vector features now.
I find vector search more convincing as a feature of an existing database than as justification to design an entirely new database - it's basically a new type of index.
I was looking at Pinecone, but if I'm reading this correctly, several open source vector DBs can pull off the same or better QPS and are open source.
I really hope Pinecone doesn't become the defacto vector DB. They're getting all the attention, but they're closed and crazily venture funded. That's going to turn into an Oracle situation fast.
I understand wanting to keep Amazon out of your business, but licences exist that allow that.
Hey, I'm from Pinecone. What we've found in the past two years is people need a lot more from a vector database than raw speed on a single-node index.
In practice, as long as search latency meets requirements (like, say, 100ms p95), the deciding factors tend to be things like cost for a given scale, amount of engineering overhead required or saved, reliability, features that affect search quality such as filtering and hybrid search, and so on.
Everyone has different workloads and different need. For example, I wouldn't recommend Pinecone to someone who just needs a pure ANN index like Faiss or HNSW on a single machine. Try out a few options and see what works for you... We make Pinecone easy + free to try for exactly this purpose, so you don't have to rely on a barebones "comparison" table from a third party.
I like that this comparison shows pg_vector in a positive light. After playing around with a few of the options at my company (graphite.dev), I’m a big fan of the simplicity of Postgres for everything. I understand there’s scaling costs, but being able to treat vectors as simply one more column type is fantastic.
People are just automatically assuming that because we had this big leap in LLMs for chat responses, we would have an equivalent jump in LLMs for embedding based retrieval. And to my knowledge there is no evidence for that.
Quite to the contrary the recent gzip paper (even if it was badly done) still shows that retrieval is a very different problem and LLMs are much less extraodinary than expected.
In my mind the whole embedding / vector DB craze will come crushing down.
I think a lot of the recent interest in embedding comes from the fact that it's so much more useful now. Now there is a way to usefully process natural language queries, and embedding is the way to retrieve related information in the course of processing the query.
Are you aware that the gzip paper fudged their accuracy numbers by assuming an oracle could correctly pick from the nearest 2 neighbors with 100% accuracy?
In other words, they published top-1 accuracy from top-2 accuracy calculations.
I would not over-index on that paper. However, I would err in favor of simpler methods.
Pinecone makes it super easy to get up and running with RAG asap. Those prices are ridiculous though and any project with legitimate scale will move on to a more affordable solution.
Hey, I'm from Pinecone. What scale are we talking about? Many of our customers come to us with 500M–10B embeddings precisely because other managed solutions either ground to a halt at that scale or cost even more.
Even so, driving the cost down for large workloads like that is a priority for us. We recognize the GenAI / RAG stack is a completely new line item in most companies' budgets so anything to keep that low can help these projects move forward.
Coming at this from a diffeeent angle, does anyone have any links to tutorials for use-cases? I’d love to see what vectorDB hype is about but as a regular engineer I’m unable to even grasp how to use a vectorDB
I'll give you an example of something i did with a vector database.
I was playing around with making my own UI for interfacing with chatgpt. I saved the chat transcripts in a normal postgres DB, along with the open AI embeddings for each message in a vector db, with a pointer to the message id in postgres in the vector DB metadata.
Then as you chatted, i had chatgpt continuously creating a summary of the current conversation you were having in the background and doing a search in the vector db for previous messages about whatever we're talking about, and it would inject that into the chat context invisibly. So you can do something like say: "Hey do you remember when we talked about baseball" and it would find a previous conversation where you talked about so and so hitting a home run into the context and the bot would have access to that, even though you never mentioned the word "baseball" in the previous conversation -- home run is semantically similar enough that it finds it.
If you're using openai embeddings as your vectors, it's _extremely_ impressive how well it finds similar topics, even when the actual words used are completely different.
Not a tutorial, but TLDR vector DBs are specialized DBs that store embeddings. Embeddings are vector representations of data (E.g. text or images), which means you can compare them in a quantifiable way.
This enables use cases like semantic search and Retrieval-Augmented Generation (RAG) as mentioned in the article.
Semantic search is: I search for "royal" and I get results that mention "king" or "queen" because they are semantically similar.
RAG is: I make a query asking, "tell me about the English royal family", semantically similar information is fetched using semantic search and provided as context to an LLM to generate an answer.
I think we need vectordb bench on 100M level.
If you don't have 100M data, and you don't care about things like filtering and streaming insertion, I vote for PGVector since SQL is convenient enough.
However, for large dataset deployment, cost becomes more critical since vector search is computation intensive. Anything like es, mongodb and redis can not even share their results in the benchmark.
Also, if you are looking for more fancy features rather simply ANN, purpose built vector database has faster iterations than traditional databases
Interesting graphic, bland and unvoiced conclusion
You're also missing a lot of details. For example, Milvus and Zilliz are actually a little different, check this out for more details: https://github.com/zilliztech/VectorDBBench (of course run it on your own stuff, don't blindly trust companies just because their product is open source)
Also if you want to throw some more comparisons in their checkout elastic search
I think user has to test by themselves. Vectorbenchmark support you to run the test by yourself on any cloud serivce or opensource deployment.
One of my guess is qdrant tune their parameters crazily on their benchmark.
There needs to be a standard for benchmarking performance of these solutions. Milvus qps seems to be in a completely different tier of performance than the rest.
It lets you run you run the benchmarks using your own API keys. Although it is made by Zilliz (maintainers of Milvus), you can take a look and see what is going on and judge if its fair.
txtai (https://github.com/neuml/txtai) is another option to consider. It has vector search with SQL, topic modeling and LLM prompt-driven search (retrieval augmented generation).
I have used RedisSearch with chatgtp-retrieval-plugin and several megabytes of documents. It works well. And setting it up is just a single docker run command away ... so I don't see myself using anything else for local development. LangChain also has support for it.
Answering my own question with a pull-quote from a post written and linked by @shreyans:
"Redis can be a simple store, either with the embedding as the entire value, or as a value in a hash along with other metadata, or their newer vector search functions. Overall this works, but is more work than necessary, and not ideal for this use case."
You put the vector in a vector database that gives you the ability to search based on the vectors. So when you create a new vector based on some input (question/etc...) you can use the vector search to find semantically similar topics in your vector database.
When thinking about a managed vs. unmanaged database, it's helpful to consider all the capabilities you'd have to take of yourself (vs. have them managed). For a complete list consider: https://www.pinecone.io/learn/vector-database/
Disclaimer: I'm the author (and work at Pinecone).
There are so many options for vector databases that it's so confusing. But those are just a piece of the puzzle when you create applications using large language models.
As mentioned in the comments, you have to choose an embeddings model, the LLM, and manage all the interaction in between.
With Vectara (full disclosure: I work there; https://vectara.com) we provide a simple API to implement applications with Grounded Generation (aka retrieval augmented generation). The embeddings model, the vector store, the retrieval engine and all the other functionality - implemented by the Vectara platform, so you don't have to choose which vector DB to use, which embeddings model to use, and so on. Makes life easy and simple, and you can focus on developing your application.
I found it amusing that clicking the “chat” icon in the corner of your website doesn’t demonstrate any of the “grounded generation” capabilities the site is referring to.
Etienne Dilocker, The Co-founder/CTO of Weaviate and Ram Sriharsha, the VP of R&D at Pinecone are both presenting at The AI Conference.
Lots of other smart people are presenting including Nazneen from Hugging Face, Harrison from Langchain, Jerry from Llamaindex, Ben the co-founder of Anthropic and many more.
A hackathon is happening in the evening at the event as well.
If you can't make the event, we'll put up all the talks on YouTube post-event.
It also looses out on qualitative attributes that distinguish some of them from the others. E.g. Weaviate has a lot better DX (in my opinion) than any of the others as, as it handles integration of different vectorizers etc. a lot better, which makes it stand out.