Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Epsilla – Open-source vector database with low query latency (github.com/epsilla-cloud)
111 points by songrenchu on Aug 14, 2023 | hide | past | favorite | 24 comments
Hey HN! We are building Epsilla (https://github.com/epsilla-cloud/vectordb), an open-source, self-hostable vector database for semantic similarity search that specializes in low query latency.

When do we need a vector database? For example, GPT-3.5 has a 16k context window limit. If we want to let it answer a question about a 300 page book, we cannot put the whole book content into the context. We have to choose the sections of the book that are most relevant to the question. Vector database is specialized at ranking and picking the most relevant content from a large pool of documents based on their semantic similarity.

Most vector databases utilize hierarchical navigational small world (HNSW) for indexing the vectors for high precision vector search, and its latency significantly degrades when the precision target is higher than 95%.

At a previous company, we worked on building the parallel graph traversal engine. We realized that the bottleneck of HNSW performance is because there are too many sequential traversal steps that don't fully leverage multi-core CPU computation resources. After some research, we found that there are algorithms such as SpeedANN that are targeting this problem, which is not leveraged by industry yet. So we built the Epsilla vector database to turn the research into a production system.

With Epsilla, we shoot for 10x lower vector search latency compared to HNSW based vector databases. We did an initial benchmark against the top open source vector databases: https://medium.com/@richard_50832/benchmarking-epsilla-with-...

We provide a Docker image for you to install Epsilla backend locally, and provide a Python client and a JavaScript client to connect and interact with it.

Quickstart:

      docker pull epsilla/vectordb

      docker run --pull=always -d -p 8888:8888 epsilla/vectordb

      pip install pyepsilla

      git clone https://github.com/epsilla-cloud/epsilla-python-client.git

      cd examples

      python hello_epsilla.py

We just started a month ago. We'd love to hear what you think, and more importantly, what you wish to see in the future. We are thinking about a serverless vector database on cloud with a consumption based pricing model, and we are eager to get your feedback.



I'm curious about your approach on where you draw the line for database features; I don't have a perspective on what's right, just trying to get informed.

There are a bunch of possible areas to circle or ignore when making an ML-capable database of some sort. In rough order of data complexity:

1. Embeddings (context-free vectors, just an ID and the vector)

2. Metadata + Embedding (source data, JSON)

3. Binary Data + Metadata + Embedding (add documents)

Then there are tooling questions: in this matrix you'd want to decide if you're going to allow inference, and if so, will it be arbitrary, service-based, etc. against the documents, and if so, how will you store the results?

I'm curious how you're thinking about the design space. The embedding-only route is conceptually appealing because it's simple. In a larger engineering project, there's a tension between "where do I keep all this data," "how do I process and reprocess all this data", and "where do I keep the results of all the processing", and to me there aren't clear bright-line architectures that seem "best of".

Put another way, 15 years ago, we went memcached -> redis 1 -> redis (whatever it is now), and at the same time, we went mysql/postgres/oracle -> nosql json stores; today all of these have relatively well-defined use cases, (and for most of them sqlite is the best choice, obviously).

How are you seeing the ML db scene playing out, and where do you think the sqlite of this space will land on architecture?


Thank you for the insightful topic! By reading the question itself drive me think a lot.

For the database perspective, instead of dividing the table schema into 3 parts: id, metadata, embedding, we designed in a way closer to SQL, treat vector as another data type, and let user to define any number of fields in a table. ID is just an annotation of a field (composite key might be overkilling for now). There will be another debate on whether schemaful or schemaless is the right approach, we can leave it here for now

With this foundation, we already covers 1 and 2. And in our roadmap we also plan to cover 3, with multi-modal data type support. We think the real big advantage of embedding is on unstructured data (documents, images, video, audio, etc), and storing the embedding of multi-modal data and connect them through semantic relevance will open up big opportunities. And this fits with the table and fields-based design for introducing cross table embedding index on connecting different shape data.

And from the multi-modal data perspective comes the problem where do we store those data? One way is we provide a generic binary data type that let users put anything. Another way which most enterprise will do is integrate us with a larger data warehouse/data lake system. And this opens up the requirement for us on supporting data streaming in/out with kafka connector, spark connector, etc.

And totally agree that SQLite works so well in huge amount of scenarios, now there is DuckDB. We also see some other players like LanceDB taking this approach to be Vector DB space's SQLite. We are also pretty close to announce our Python in-process package support, so docker / a separate server is not a must have anymore.

For inference, this is a broader direction for us for now. We are open to explore this space and see if the serverless architecture on cloud can provide extra efficiency benefit to the market


Thanks for the thoughts. I agree that you're not going to disintermediate existing datalakes, no matter how successful, so integration makes sense.

Every few months I run up into a use case where I'm like "I want to get a whole bunch of data, analyze it, then search for it later with embeddings, and probably keep running different sorts of analysis on it, and store the embeddings of those analyses in a related way." This still feels fairly difficult to do, or at least there aren't canonical "right" architectures yet.

My instinct is if you nail the ml+dev+data ops needs with good architecture and api you could really have something -- good luck!


Vector databases seem to be a dime a dozen, now, as well as being built into Elasticsearch and available as Postgres extensions.

As far as I know they’re all relatively undifferentiated in performance and features.

Is there a viable long-term business here?


This one has more reason to exist than most vector DBs since it's not just a wrapper around hnswlib.


Well, hnswlib is actually faster than epsilla according to benchmarks (compare their own vs ann benchmark), especially in terms of throughput


They claim that they are 10x faster given the high accuracy target ( no clue what that means in practice for the AI use case, probably less tokens for the LLM). Can you elaborate why you think hnswlib is still faster? Can you link the benchmark you mention.


Sure. The benchmark from Epsilla https://miro.medium.com/v2/resize:fit:1400/format:webp/1*dDy..., the benchmark for the same dataset and same K (10) from ann benchmark https://ann-benchmarks.com/gist-960-euclidean_10_euclidean.h... At fixed recall (which is the mentioned accuracy) 0.95 Epsilla gets 200 QPS using multiple intra-threads and a single inter-thread. Hnswlib gets more than 370 QPS at higher 0.97 recall and both single intra- and inter- threads, which is much faster and uses less CPU.

Because hnswlib does not use intra-threads it will scale much better in terms of full throughput, probably close to 7X-8X with 16 threads on 16 vCPUs (compared to Epsilla which saturates with 2.2X improvement from multiple threads). The main premise of Epsilla's solution is trading throughput for latency, which is probably legit but would not work for all.

Note that even though the hardware between the benchmarks is not controlled (Epsilla only says it is some AWS EC2 16C32G, ann benchmark uses AWS r6i.16xlarge), it does not matter that much since the single threaded cpu speeds are pretty stagnant over the years, so ann benchmark single-thread results can be transferred (unless Epsilla is using non-x64 hardware, which would be a weird choice). There is a constant overhead from communication between the nodes in Epsilla, but it is constant and should not affect the speed at high recalls (for which the hnswlib is also faster).


Interesting project. How does it fare on scaling? We were evaluating vector dbs and since we were into b2b saas, we are keen on the sharding, scaling and multi tenancy features. currently we are more inclined towards milvus. https://zilliz.com/comparison/qdrant-vs-milvus


I know this isn't your focus, but could this be added as an index type to pgvector ? They recently added HNSW, and ship with IVFlat for a while, so I wonder if that would be possible.


Regarding the embedding vectors - is there a maximum limit to their dimensionality? Also, can you share insights into how the precision remains consistent at 99.9% even with high-dimension vectors?


For now we didn't put a limit on the dimension of the vectors, so the machine can fit as much as #vector * #dimension * sizeof(float) into memory. For now we just support dense vector, and in the future we will work on sparse vector support for much higher dimension. I think you are referring to the "Curse of dimensionality" problem. Here is my thoughts: in a graph-based index such as SpeedANN, or HNSW, each vector is treated as a node in the graph, and the index is a nearest neighbor graph. Different from spatial partition-based indices, the topology quality of the nearest neighbor graph is independent from the dimensionality of the vectors. Our benchmark is on 960 dimension vector, but we will do more experiments in sparse vectors in the future


Why did you choose SpeedANN instead of other new indexes such as DiskANN? And you changed the color of epsilla in every benchmark figure, which is quite confusing


Thank you for sharing! DiskANN was published in 2019 and SpeedANN in 2022. DiskANN is specialized in disk based ANNS solution, and it's focus on the scenario where the vectors don't fit into memory. SpeedANN is in-memory solution and specialize in low latency query, which is the scenario we want to tackle for now. We can further extend our engine to support DiskANN and other index algorithms based on our customer's requirements Thanks for pointing out the benchmark figure, we just fixed it to have consistent colors


How long until you sell it to BigCompany? I just don’t get why there are numerous vector databases all with similar functionality.


You are right, there are numerous vector databases in the market. Most of them (including us) are still pretty early and a lot of enterprise readiness features to build. Including role based / privilege based access control, authN/Z integration, data versioning and backup / restore, fault tolerance, data streaming in/out, etc. We have first hand experience on the enterprise level product development and sales in our previous job at a series D graph database startup, and we will apply our learnings there to make Epsilla enterprise ready in the next few months


Because wrapping HNSW is just that easy. Same with all the ChatGPT-based tools popping around a dozen a week, it's just easy to throw something together and see if it sticks.


imho, vectorDbs need to scale horizontally. Simply running on a single host doesn't cut it anymore.


You are right. We designed our storage in a segment-based way, with configurable segment size, so it can horizontally scale in the future cross multiple workers in one machine, and cross multiple machine cluster. And the search will become a two-stage search: find top K in each segment, then a global merger (can also be horizontally scaled) to merge the results from all segments


Congrats on the launch!


“Hippocampus of AI” in a readme is a yellow flag


Thank you for pointing it out. We just removed it from README


Why it's a yellow flag?


I don't see how you compete against the 50 other providers. You just implemented a speedup on the graph search (maybe credit the authors of the paper you are ripping off in your github?). However this speedup trades the overall throughput for latency. And even in the paper, its not exact.

Maybe drop the disingenuous marketing and find something else to work on. The 50 other vector dbs will implement this trivial addition and you'll be left with nothing to show.

Sources: https://arxiv.org/abs/2201.13007 https://dl.acm.org/doi/pdf/10.1145/3572848.3577527




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: