Hey HN! We are building Epsilla (
https://github.com/epsilla-cloud/vectordb), an open-source, self-hostable vector database for semantic similarity search that specializes in low query latency.
When do we need a vector database? For example, GPT-3.5 has a 16k context window limit. If we want to let it answer a question about a 300 page book, we cannot put the whole book content into the context. We have to choose the sections of the book that are most relevant to the question. Vector database is specialized at ranking and picking the most relevant content from a large pool of documents based on their semantic similarity.
Most vector databases utilize hierarchical navigational small world (HNSW) for indexing the vectors for high precision vector search, and its latency significantly degrades when the precision target is higher than 95%.
At a previous company, we worked on building the parallel graph traversal engine. We realized that the bottleneck of HNSW performance is because there are too many sequential traversal steps that don't fully leverage multi-core CPU computation resources. After some research, we found that there are algorithms such as SpeedANN that are targeting this problem, which is not leveraged by industry yet. So we built the Epsilla vector database to turn the research into a production system.
With Epsilla, we shoot for 10x lower vector search latency compared to HNSW based vector databases. We did an initial benchmark against the top open source vector databases: https://medium.com/@richard_50832/benchmarking-epsilla-with-...
We provide a Docker image for you to install Epsilla backend locally, and provide a Python client and a JavaScript client to connect and interact with it.
Quickstart:
docker pull epsilla/vectordb
docker run --pull=always -d -p 8888:8888 epsilla/vectordb
pip install pyepsilla
git clone https://github.com/epsilla-cloud/epsilla-python-client.git
cd examples
python hello_epsilla.py
We just started a month ago. We'd love to hear what you think, and more importantly, what you wish to see in the future. We are thinking about a serverless vector database on cloud with a consumption based pricing model, and we are eager to get your feedback.
There are a bunch of possible areas to circle or ignore when making an ML-capable database of some sort. In rough order of data complexity:
1. Embeddings (context-free vectors, just an ID and the vector)
2. Metadata + Embedding (source data, JSON)
3. Binary Data + Metadata + Embedding (add documents)
Then there are tooling questions: in this matrix you'd want to decide if you're going to allow inference, and if so, will it be arbitrary, service-based, etc. against the documents, and if so, how will you store the results?
I'm curious how you're thinking about the design space. The embedding-only route is conceptually appealing because it's simple. In a larger engineering project, there's a tension between "where do I keep all this data," "how do I process and reprocess all this data", and "where do I keep the results of all the processing", and to me there aren't clear bright-line architectures that seem "best of".
Put another way, 15 years ago, we went memcached -> redis 1 -> redis (whatever it is now), and at the same time, we went mysql/postgres/oracle -> nosql json stores; today all of these have relatively well-defined use cases, (and for most of them sqlite is the best choice, obviously).
How are you seeing the ML db scene playing out, and where do you think the sqlite of this space will land on architecture?