Chroma doesn't seem to be a real DB, it's rather a wrapper around tools like hnswlib, DuckDB or Clickhouse. Qdrant is way more mature - it has its own HNSW implementation with some tweaks to incorporate filtering directly during the vector search phase, supports horizontal and vertical scaling, as well as provides its own managed cloud offering.
In general, Qdrant is a real DB, not a library and that's a huge difference.
I've tried both Chroma and Qdrant. I don't think Chroma lacks that much. Definitely newer, but is also a great product. I think cloud support coming Q3 2023
I tried out milvus. Developer Experience is crap. Documentation lacks some major core concepts. I've been experimenting with it for hours. Eventually I turned my back and said: Why not use pg_vector and scale the fuck out of the cluster? That should bring.. equal performance, as the pg_vector implementation is written in c and the comparing algorithms wouldn't differ too much from milvus.
IMO vector databases should not mess with ElasticSearch.
The real focus should be to improve the recall of vector search. Pity that nobody is doing real AI research here. Money wasted in marketing and branding.
Totally agree. The thing is that ElasticSearch does not meet our requirements in vector searching.
I am currently running with Milvus + ElasticSearch, works perfect. The latest Milvus version is super fast and scalable (>50M vectors). Haven't tried Zilliz Cloud. Have to find out what the cost is.
I am old school. IMO ElasticSearch is only good for keyword search and these so called "vector databases" products are only good for vector search.
Others have made other suggestions, but Vespa has two unique features. First it is battle tested at a large scale, second it supports combining the keyword and vector scores in several ways. The latter is something that other hybrid systems don't do very well in my experience.
Didn't even realise Milvus was so lacking. https://github.com/marqo-ai/marqo also has a hybrid approach. It's just a more complete/end-to-end platform than pinecone, so it really just depends on what you're building
Could you please elaborate on how you utilize both of them together, and for which specific use case? I'm attempting to gain a better understanding of the hybrid approach.
The thing is to make ElasticSearch scores "comparable" to Milvus scores. Lots of ways to do this, but there's no single good solution. For example you could calculate BM25 score offline, or use TF-IDF score to do some kind of filtering. Again there's no single perfect answer. You'd have to do a lot of experiment according to your own use case and your own data to get the best results.
Also a lot of tuning needs to be done during all phases:
1) query pre-processing
2) query tokenizing
3) retrieval
4) ranking and reranking
I personally would not trust any universal "hybird-search" solutions. All toy demos.
It usually takes 5-10 good engineers to build a decent search engine/system for any real use case. It also requires a lot of turning, tricks, hand-written rules to make things work.