Show HN: Epsilla – Open-source vector database with low query latency

vessenes · on Aug 14, 2023

I'm curious about your approach on where you draw the line for database features; I don't have a perspective on what's right, just trying to get informed.

There are a bunch of possible areas to circle or ignore when making an ML-capable database of some sort. In rough order of data complexity:

1. Embeddings (context-free vectors, just an ID and the vector)

2. Metadata + Embedding (source data, JSON)

3. Binary Data + Metadata + Embedding (add documents)

Then there are tooling questions: in this matrix you'd want to decide if you're going to allow inference, and if so, will it be arbitrary, service-based, etc. against the documents, and if so, how will you store the results?

I'm curious how you're thinking about the design space. The embedding-only route is conceptually appealing because it's simple. In a larger engineering project, there's a tension between "where do I keep all this data," "how do I process and reprocess all this data", and "where do I keep the results of all the processing", and to me there aren't clear bright-line architectures that seem "best of".

Put another way, 15 years ago, we went memcached -> redis 1 -> redis (whatever it is now), and at the same time, we went mysql/postgres/oracle -> nosql json stores; today all of these have relatively well-defined use cases, (and for most of them sqlite is the best choice, obviously).

How are you seeing the ML db scene playing out, and where do you think the sqlite of this space will land on architecture?

songrenchu · on Aug 14, 2023

Thank you for the insightful topic! By reading the question itself drive me think a lot.

For the database perspective, instead of dividing the table schema into 3 parts: id, metadata, embedding, we designed in a way closer to SQL, treat vector as another data type, and let user to define any number of fields in a table. ID is just an annotation of a field (composite key might be overkilling for now). There will be another debate on whether schemaful or schemaless is the right approach, we can leave it here for now

With this foundation, we already covers 1 and 2. And in our roadmap we also plan to cover 3, with multi-modal data type support. We think the real big advantage of embedding is on unstructured data (documents, images, video, audio, etc), and storing the embedding of multi-modal data and connect them through semantic relevance will open up big opportunities. And this fits with the table and fields-based design for introducing cross table embedding index on connecting different shape data.

And from the multi-modal data perspective comes the problem where do we store those data? One way is we provide a generic binary data type that let users put anything. Another way which most enterprise will do is integrate us with a larger data warehouse/data lake system. And this opens up the requirement for us on supporting data streaming in/out with kafka connector, spark connector, etc.

And totally agree that SQLite works so well in huge amount of scenarios, now there is DuckDB. We also see some other players like LanceDB taking this approach to be Vector DB space's SQLite. We are also pretty close to announce our Python in-process package support, so docker / a separate server is not a must have anymore.

For inference, this is a broader direction for us for now. We are open to explore this space and see if the serverless architecture on cloud can provide extra efficiency benefit to the market

vessenes · on Aug 19, 2023

Thanks for the thoughts. I agree that you're not going to disintermediate existing datalakes, no matter how successful, so integration makes sense.

Every few months I run up into a use case where I'm like "I want to get a whole bunch of data, analyze it, then search for it later with embeddings, and probably keep running different sorts of analysis on it, and store the embeddings of those analyses in a related way." This still feels fairly difficult to do, or at least there aren't canonical "right" architectures yet.

My instinct is if you nail the ml+dev+data ops needs with good architecture and api you could really have something -- good luck!

vosper · on Aug 14, 2023

Vector databases seem to be a dime a dozen, now, as well as being built into Elasticsearch and available as Postgres extensions.

As far as I know they’re all relatively undifferentiated in performance and features.

Is there a viable long-term business here?

snordgren · on Aug 14, 2023

This one has more reason to exist than most vector DBs since it's not just a wrapper around hnswlib.

blast_one · on Aug 15, 2023

Well, hnswlib is actually faster than epsilla according to benchmarks (compare their own vs ann benchmark), especially in terms of throughput

riedel · on Aug 15, 2023

They claim that they are 10x faster given the high accuracy target ( no clue what that means in practice for the AI use case, probably less tokens for the LLM). Can you elaborate why you think hnswlib is still faster? Can you link the benchmark you mention.

blast_one · on Aug 15, 2023

Sure. The benchmark from Epsilla https://miro.medium.com/v2/resize:fit:1400/format:webp/1*dDy..., the benchmark for the same dataset and same K (10) from ann benchmark https://ann-benchmarks.com/gist-960-euclidean_10_euclidean.h... At fixed recall (which is the mentioned accuracy) 0.95 Epsilla gets 200 QPS using multiple intra-threads and a single inter-thread. Hnswlib gets more than 370 QPS at higher 0.97 recall and both single intra- and inter- threads, which is much faster and uses less CPU.

Because hnswlib does not use intra-threads it will scale much better in terms of full throughput, probably close to 7X-8X with 16 threads on 16 vCPUs (compared to Epsilla which saturates with 2.2X improvement from multiple threads). The main premise of Epsilla's solution is trading throughput for latency, which is probably legit but would not work for all.

Note that even though the hardware between the benchmarks is not controlled (Epsilla only says it is some AWS EC2 16C32G, ann benchmark uses AWS r6i.16xlarge), it does not matter that much since the single threaded cpu speeds are pretty stagnant over the years, so ann benchmark single-thread results can be transferred (unless Epsilla is using non-x64 hardware, which would be a weird choice). There is a constant overhead from communication between the nodes in Epsilla, but it is constant and should not affect the speed at high recalls (for which the hnswlib is also faster).

chandureddyvari · on Aug 15, 2023

Interesting project. How does it fare on scaling? We were evaluating vector dbs and since we were into b2b saas, we are keen on the sharding, scaling and multi tenancy features. currently we are more inclined towards milvus. https://zilliz.com/comparison/qdrant-vs-milvus

xfalcox · on Aug 14, 2023

I know this isn't your focus, but could this be added as an index type to pgvector ? They recently added HNSW, and ship with IVFlat for a while, so I wonder if that would be possible.

social_quotient · on Aug 14, 2023

Regarding the embedding vectors - is there a maximum limit to their dimensionality? Also, can you share insights into how the precision remains consistent at 99.9% even with high-dimension vectors?

songrenchu · on Aug 14, 2023

For now we didn't put a limit on the dimension of the vectors, so the machine can fit as much as #vector * #dimension * sizeof(float) into memory. For now we just support dense vector, and in the future we will work on sparse vector support for much higher dimension. I think you are referring to the "Curse of dimensionality" problem. Here is my thoughts: in a graph-based index such as SpeedANN, or HNSW, each vector is treated as a node in the graph, and the index is a nearest neighbor graph. Different from spatial partition-based indices, the topology quality of the nearest neighbor graph is independent from the dimensionality of the vectors. Our benchmark is on 960 dimension vector, but we will do more experiments in sparse vectors in the future

VoVAllen · on Aug 14, 2023

Why did you choose SpeedANN instead of other new indexes such as DiskANN? And you changed the color of epsilla in every benchmark figure, which is quite confusing

songrenchu · on Aug 14, 2023

Thank you for sharing! DiskANN was published in 2019 and SpeedANN in 2022. DiskANN is specialized in disk based ANNS solution, and it's focus on the scenario where the vectors don't fit into memory. SpeedANN is in-memory solution and specialize in low latency query, which is the scenario we want to tackle for now. We can further extend our engine to support DiskANN and other index algorithms based on our customer's requirements Thanks for pointing out the benchmark figure, we just fixed it to have consistent colors

behnamoh · on Aug 14, 2023

How long until you sell it to BigCompany? I just don’t get why there are numerous vector databases all with similar functionality.

songrenchu · on Aug 14, 2023

You are right, there are numerous vector databases in the market. Most of them (including us) are still pretty early and a lot of enterprise readiness features to build. Including role based / privilege based access control, authN/Z integration, data versioning and backup / restore, fault tolerance, data streaming in/out, etc. We have first hand experience on the enterprise level product development and sales in our previous job at a series D graph database startup, and we will apply our learnings there to make Epsilla enterprise ready in the next few months

theolivenbaum · on Aug 14, 2023

Because wrapping HNSW is just that easy. Same with all the ChatGPT-based tools popping around a dozen a week, it's just easy to throw something together and see if it sticks.

itake · on Aug 14, 2023

imho, vectorDbs need to scale horizontally. Simply running on a single host doesn't cut it anymore.

songrenchu · on Aug 14, 2023

You are right. We designed our storage in a segment-based way, with configurable segment size, so it can horizontally scale in the future cross multiple workers in one machine, and cross multiple machine cluster. And the search will become a two-stage search: find top K in each segment, then a global merger (can also be horizontally scaled) to merge the results from all segments

yding · on Aug 14, 2023

Congrats on the launch!

sidhantgandhi · on Aug 14, 2023

“Hippocampus of AI” in a readme is a yellow flag

songrenchu · on Aug 14, 2023

Thank you for pointing it out. We just removed it from README

VoVAllen · on Aug 15, 2023

Why it's a yellow flag?

fireysausage · on Aug 14, 2023

I don't see how you compete against the 50 other providers. You just implemented a speedup on the graph search (maybe credit the authors of the paper you are ripping off in your github?). However this speedup trades the overall throughput for latency. And even in the paper, its not exact.

Maybe drop the disingenuous marketing and find something else to work on. The 50 other vector dbs will implement this trivial addition and you'll be left with nothing to show.

Sources: https://arxiv.org/abs/2201.13007 https://dl.acm.org/doi/pdf/10.1145/3572848.3577527