Hacker News new | past | comments | ask | show | jobs | submit login
VectorDB: Vector Database Built by Kagi Search (vectordb.com)
317 points by promiseofbeans on Nov 26, 2023 | hide | past | favorite | 93 comments



Dev here. Thanks for submitting. To be fair this is not really a database, but a wrapper around few primitives such as locally-ran embeddings and FAISS/mrpt with a ton of benchmarking behind the doors to offer sane defaults that minimize latency.

Here is an example Colab notebook [1] where this is used to filter the content of the massive Kagi Small Web [2] RSS feed based on stated user interests:

[1] https://colab.research.google.com/drive/1pecKGCCru_Jvx7v0WRN...

[2] https://kagi.com/smallweb


This is a wrapper around FAISS, a vector search library. FAISS has a simple API, so it might be a better for your use case if you don't need heavyweight libraries that VectorDB requires, which include PyTorch, Tensorflow, and Transformers.


This is possible but how would you encode your vectors?


Bring your own embeddings. PyTorch and TensorFlow packages are 2GB+ each (don't quote me on that), which is unnecessary if you're making a network call to your favorite embedding service.



For my use-case, I wanted whole-paragraph embeddings (iirc these are trained on max-256-token sentence pairs)

Their simple suggestion to extend to longer texts is to pool/avg sentence embeddings - but I'm not so sure that I want that; for instance eg that implies order between sentences didn't matter. If I were forced to use sentence transformers for my use case, then the real fix would be to train an actual pooling model atop of the sentence embedder, but I didn't want to do that either. At that point I stopped looking into it, but I'm certain there are newer models out nowadays that have both better encoders and handle much longer texts. The one nice thing about the sentence transformer models though is that they are much more lightweight than eg a 7B param language model


For English-language embeddings, the bge models from the Beijing Academy of Artificial intelligence outperform SBERT. There is a leaderboard somewhere on Hugging Face, but I can’t find it at the moment.


Agreed and the question. If you take out the vector encoding part and just use Faiss, then what would be the proposal to encode vectors.


... Whatever you want? Even with a vector DB, embeddings are typically BYO to beginwith - dependencies not in your DB and computed in pipelines well before it. It's handy for small apps to do in DB and have support in DB for some queries, but as things get big, doing all in the DB gets weirder...

Edit: I see you run a vectordb company, so your question makes more sense


True but the comment discussed that it should remove "heavyweight libraries". Unless you use an API service, those libraries will need to be imported somewhere. It doesn't necessarily have to run on the same server as the database.


Vector embedding at the app tier & orchestration / compute tier make more sense for managing the dependency than the vector DB tier for the bulk of ML/AI projects I've worked on. Round tripping through the vectordb would be an architectural, code, and perf headache. Ex: Just one of the many ways we use embeddings is as part of prefiltering what goes into the DB, running in a bulk pipeline in a diff VPC, and in a way we dont want to interfere with DB utilization.

We generally avoid using embedding services to beginwith.. outside calls is the special case. Imagine something heavy like video transcription via Google APIs, not the typical one of 'just' text embedding. The actual embedding is generally one step of broader data wrangling, so there needs to be a good reason for doing something heavy and outside our control... Which has been rare.

Doing in the DB tier is nice for tiny projects, simplifying occasional business logic, etc, but generally it's not a big deal for us to run encode(str) when building a DB query.

Where DB embedding support gets more interesting to me here is layering on additional representation changes on top, like IVF+PQ... But that can be done after afaict? (And supporting raw vectors generically here, vs having to align our python & model deps to our DB's, is a big feature)


If a team has an operational store already in Postgres, wouldn't it be best to just use the PGVector extension? So the data and the vector search functionality sit together, and there's one less moving part in the tech stack to manage?


This recent thread discussed this - https://news.ycombinator.com/item?id=38416994


Kagi search is amazing, i've used it for the last few months. If this is what they are using to power it, i'm optimistic.


From the GitHub repo:

> Thanks to its low latency and small memory footprint, VectorDB is used to power AI features inside Kagi Search.


Where does it save the data to? How is it persisted?

Is there any limitation to this? Does it work with text 500-1000 words? Does it work well with text that aren’t full sentences? Ie. Are just a collection of phrases?


The README.md example just has it in memory. Looking at the source, the Storage class uses a plain file and python's pickle module:

https://github.com/kagisearch/vectordb/blob/main/vectordb/st...


The two most interesting things to me in a "minimal" framework would be eliminating dependency on HF transformers and helping customize chunking.

Not a knock against this project, I see where it can be helpful.


My assumption is that this gives you the ability to locally encode vectors. This is useful for those not using API services to build their vectors.


Transformer inference is ~60 lines of numpy[0] (closer to 500 once you add tokenization etc). It would be nice to just have this and not all of pytorch and transformers.

[0] https://jaykmody.com/blog/gpt-from-scratch/


It's 60 lines for CPU only inference, which'll be slow. If you want GPU acceleration it'll be a lot more than 60 lines.


What about models besides GPT? Most of the popular vector encoding models aren't using this architecture.

If you really didn't want PyTorch/Transformers, you could consider exporting your models to ONNX (https://github.com/microsoft/onnxruntime).


Yeah, I just tried this out (props to the devs, super easy to set up) and my main gripe is the chunking algorithms aren't great - could be alot more useful with a context option that gives surrounding results to search results. The sliding window chunking method always cuts off the start of sentences.


I've found it works better to chunk by some logical sections in the document, e.g header h2 h3 h4 etc or 1.1 1.1.1 ... plus to be able to ignore some stuff (header and footer) plus other customizations.

At least for use cases where there are clusters of many similarly formatted documents, it would be cool to have a way of easily customizing chunking.


Wonder why Crystal was not used.


Crystal is a super cool and underrated language. But why do you mention it here? Does kagi use crystal in other places?


Based on their job postings, they use it for most of their back-end: https://help.kagi.com/kagi/company/hiring-kagi.html#full-tim...


They use it for search afaik.


Yep[0] 70k lines of crystal code for kagi search[1]

[0] https://help.kagi.com/orion/company/hiring-kagi.html#full-ti...

[1] CrystalConf talk from a Kagi Tech Leas https://www.youtube.com/watch?v=r7t9xPajjTM


I can’t decide if I hate architecture-by-drawio or not.


Yes it is made with Crystal


I read somewhere that typically crystal is adopted by ruby developers

not sure if there is something similar in Python


Nim maybe?


Nim is probably closest, but it's probably more influenced by pascal, modula-2, Oberon... something like that iirc


it is, it's still fairly complex.

wish there is something simpler but statically-compiled with python syntax, something like crystal but for python developers.


Scala 3 with Braceless Syntax (scala-native for x86)

https://docs.scala-lang.org/scala3/reference/other-new-featu...


kotlin seems to be the Java trend now,anyway, don't know much about java


Not sure if it helps, but as someone who is a big fan of scripting languages like Python and someone who loves playing with all kinds of programming languages...I found Scala, Clojure, and Kotlin to be a much more frustrating experience than Python in a lot of ways. I wouldn't say there is enough in common.


What is being used to actually create the embeddings?


https://github.com/kagisearch/vectordb/blob/453bb658bb710838...

Looks like it uses one of these, depending on your settings:

Fast model: google/universal-sentence-encoder/4

Multilingual model: universal-sentence-encoder-multilingual-large/3

Normal model (Alternative): BAAI/bge-small-en-v1.5

Best model: BAAI/bge-base-en-v1.5


The library is open source. Here's where you can see how they're creating the embeddings. https://github.com/kagisearch/vectordb/blob/main/vectordb/em...


Is there any kind of comaprison of the different vector DBs? What would you choose for different use cases? How do they differ?


This thread from a few months ago is a good read - https://news.ycombinator.com/item?id=36943318


I always run into the issue of loading existing embedding. For instance, I want to embed a folder which have five files and two are new, is there a way to only add the two new files to the stored embedding using this or chromaDB?


Keep track of the files and when you see new ones only add those.



super cool! looks like i can make a local search engine out of the massive datahoard of pdf books and articles i have


This or anyone suggest some other db/lib for local QnA like testing on Mac?


Just posted this thread that gives an in-depth look at LLM frameworks (local ones included) - https://news.ycombinator.com/item?id=38422264


It shouldn’t be legal to call your vectordb vectordb


Someone decided to call a software "informatica".

A british recruiter was quite insistent to have a call… turns out because she read "informatica" on my profile (as in, laurea in informatica), asked me how many years of experience I had with the tool, I replied "I just heard of it 2 minutes ago when you first mentioned it". Then got mad at me for having written the word "informatica" on my profile.


Well, they do have the .com domain, so maybe an exception applies here.


Sometimes I wonder if Microsoft was actually on to something with their naming scheme.

Something like “Kagi vector search for databases” at least doesn’t leave anything up for misinterpretation.


I prefer it to, say, vectr (such contractions used to be popular) or 'barbershop' or a similar single irrelevant word name as is popular now.


Att least with barbershop it becomes much more searchable. "barbershop db" will very likely succeed.


Good point. It trades off searchability with understanding (or getting an indication of) what it is when you first hear of it.


I think there's two ways to name things and I think the cattle or pet metaphor is useful here.

If the typical user will use the thing once and then forget about it or if they are not going to keep it in mind very often make sure to give it a descriptive name.

But for some things having a descriptive name might not be the most important aspect of its name. Having a findable or memorable name might be more important. And maybe sometimes giving it a descriptive name might actually become a problem because its function might change over time.


I fail to understand why I should use it over a different embedded vector DB like LanceDB or Chroma. Both are written in more performant languages, have a simple API with a lot of integrations and power if one needs it


To be fair, Chroma is also written in Python. And while LanceDB and others are written in Rust, that doesn't automatically give it super powers.


Python programmer for 15 years and i picked up rust to write an oAuth gateway not long ago ; i wrote it in python beforehand - rust DOES give you superpowers ; especially if you compare it to something like python that isn't nowhere as fast and has no typing


There are plenty of examples of Python libraries that can be performant such as NumPy and PyTorch (which both rely on C/C++). Some libraries such as Hugging Face's tokenizers even use Rust.

I referenced this article below but will reference it again here too. https://neuml.hashnode.dev/building-an-efficient-sparse-keyw....

You can write performant code in any language if you try.


NumPy is a C library with Python frontend, moreover lots of functionality based on other existing C libraries like blas etc.

PyTorch, quoting themselves, is a Python binding into a monolithic C++ framework; also optionally depending on existing libs like mkl etc.

> You can write performant code in any language if you try.

Unfortunately, only to a certain extent. Sure, if you just need to multiply a handful of matrices and you want your blas ops to be blas'ed where the sheer size of data outweighs any of your actual code, it doesn't really matter. Once you need to implement lower-level logic, ie traversing and processing the data in some custom way, especially without eating extra memory, you're out of luck with Python/numpy and the rest.


> NumPy is a C library with Python frontend

I guess this is a pretty legitimate take, but in that case VectorDB looks like (from the got repo) it makes huge use of libraries like pytorch and numpy.

If numpy is fast but "doesn't count" because the operations aren't happening in python, then I guess VectorDB isn't in python either by that logic?

On the other hand, if it is in Python despite shipping operations out to C/C++ code, then I guess numpy shows that can be an effective approach?


BLAS can be implemented in any language. In terms of LOC, most BLAS might be C libraries, but the best open source BLAS, BLIS, is totally structured around the idea of writing custom, likely assembly, kernels for a platform. So, FLOPs-wise it is probably more accurate to call it an assembly library.

LAPACK and other ancillary stuff could be Fortran or C.

Anyway, every language calls out to functions and runtimes, and compiles (or jits or whatever) down to lower level languages. I think it is just not that productive to attribute performance to particular languages. Numpy calls BLAS and LAPACK code, sure, but the flexibility of Python also provides a lot of value.

How does Numba fit into this hierarchy?


This is unfortunately not correct once you start pushing the boundaries requiring careful allocation of memory, CPU cache and COU itself, see this table:

https://stratoflow.com/efficient-and-environment-friendly-pr...


I don't accept that. In the referenced article you're pulling in stuff which I believe is written in a different language (probably C). If you use native python, I'm sure you would accept it would be much slower and take up much more memory. So we have to disagree here.


Where do you draw the line? Most of CPython is written in C including the arrays package (https://docs.python.org/3/library/array.html) mentioned in that article.

Yes, pure Python is slower and takes up more memory. But that doesn't mean it can't be productive and performant using these types of strategies to speed up where necessary.


With respect, I think you're clouding things by trying to defend what is really defensible. Okay then.

> Where do you draw the line?

Drawing the line at native python, not pulling in packages that are written in another language. Packages written in python only are acceptable in this argument.

> But that doesn't mean it can't be productive and performant using these types of strategies to speed up where necessary.

No one said it couldn't. What we're saying is that it pure python is 'slow' and you need to escape from pure python to get the speedups.


I agree that pure Python isn't as fast as other options. Just comes down to a productivity tradeoff for developers. And it doesn't have to be one or the other.


Agreed, then!


So to make Python fast you just need to write a library in another language, brilliant


If you read the article referenced, I discussed a number of ways to write performant Python such as using this package (https://docs.python.org/3/library/array.html).


> Python programmer for 15 years [...] [Python] has no typing

Ok, I have to call this statement out. Mypy was released 15 years ago, so Python has had optional static typing for as long as you've been programming in it, and you don't know about it?

I guess it's going to take another fifteen years for this 2008 trope to die.


I'm primarily a Python programmer, I love mypy and the general typing experience in Python (I think it's better than TypeScript - fight me), but are you seriously comparing it to something - anything - with proper types like Rust?


> I think it's better than TypeScript - fight me

I used Python type hints and MyPy since long before I used TypeScript, and I have to say that TypeScript's take on types is just plain better (that doesn't mean it's good though).

1. More TypeScript packages are properly typed thanks to DefinitelyTyped. Some Python packages such as Numpy could not be properly typed last I checked, I think it might change with 3.11 though. Packages such as OpenCV didn't have any types last I checked.

2. TypeScript's type system is more complete, with better support for generics, this might change with 3.11/3.12 though.

3. TypeScript has more powerful type system than most languages, as it is Turing-complete and similar in functionality to a purely functional language (this could also be a con)


> but are you seriously comparing it to something - anything - with proper types like Rust?

Re-reading my comment, no, I did not. I said it has static typing.


> I have to call this statement out.

Why? That was just mean for no reason!


Is that mean? Sorry, English is not my native language. I just meant that I have to express my doubt of the veracity of the statement.


Your language is fine (I’ve enjoyed your blog posts too, never gave it a thought that English wasn’t your first language), I just thought it was unnecessarily hurtful to say they must be a phony because they didn’t know something.

But, everybody else seems to agree so maybe I’ve been had.


I didn't mean to say they are a phony, just that that statement is inaccurate/poorly thought out.


English is my native language. "I have to call out" is a perfectly fine (and polite) way to express doubt of veracity.


Python does have typing. Although it doesn’t feel as “first class” like as rust or golang it gets the job done.


Yea everyone should just rewrite EVERYTHING in Rust! /s


Fair point, then you could claim it's similar to this DB with its reliance on Faiss. Despite that, Chroma at this point is more feature rich. I was mostly referring to this https://thedataquarry.com/posts/vector-db-1/

You are not wrong about the performance from Rust, but LanceDB is inherently written with performance in mind. SIMD support for both x86 and ARM, and an underlying vector storage approach that's built for speed (Lance)


I've seen a number of projects come over the last couple years. I'm the author of txtai (https://github.com/neuml/txtai) which I started in 2020. How you approach performance is the key point.

You can write performant code in any language. For example, for standard keyword search, I wrote a component to make sparse/keyword search just as efficient as Apache Lucene in Python. https://neuml.hashnode.dev/building-an-efficient-sparse-keyw....


>more feature rich

Not necessarily a good thing when the product is made by a VC backed startup that may die or pivot in six months leaving you the need to maintain it yourself.


It is faster!

We needed a low latency, on premise solution that we can run on edge nodes with sane defaults that anyone in the team can whim in a sec. Also worth noting is that our use case is end to end retrieval of usually few hundred to few thousand chunks of text (for example in Kagi Assistant research mode) that need to be processed once at run time with minimal latency.

Result is this. We periodically benchmark the performance of different embeddings to ensure best defaults:

https://github.com/kagisearch/vectordb#embeddings-performanc...


I thought the API here was quite neat. It's fairly simple to implement a lancedb backend for it instead of sklearn/faiss/mrpt as the source code is really simple.

This repo is basically just a nice api and the needed chunking and batching logic. Using lancedb, you'd still have to write that, as exemplified here: https://github.com/prrao87/lancedb-study/blob/main/lancedb/i...


Same for me. I started using Chroma (about) a year ago, I am used to it, and if I am using Python I look no further.

When I use Common Lisp or Racket I roll my own simple vector embeddings data store, but that is just me having fun.


https://github.com/wallabag/wallabag

No one has mentioned wallabag yet, so wanted to. Been working well for me - has apps and extensions. If you’re not excited to self-host - https://www.wallabag.it/en has been flawless with the exorbitant price of… 11 euro a year.


This isn't even slightly related to a vector database. I like Wallabag but this comes off as a shameless plug.


Wrong article I think; there was another post about a bookmark manager on here.


It happens. Rereading my comment I hope that didn't come too rude




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: