I fail to understand why I should use it over a different embedded vector DB like LanceDB or Chroma. Both are written in more performant languages, have a simple API with a lot of integrations and power if one needs it
Python programmer for 15 years and i picked up rust to write an oAuth gateway not long ago ; i wrote it in python beforehand - rust DOES give you superpowers ; especially if you compare it to something like python that isn't nowhere as fast and has no typing
There are plenty of examples of Python libraries that can be performant such as NumPy and PyTorch (which both rely on C/C++). Some libraries such as Hugging Face's tokenizers even use Rust.
NumPy is a C library with Python frontend, moreover lots of functionality based on other existing C libraries like blas etc.
PyTorch, quoting themselves, is a Python binding into a monolithic C++ framework; also optionally depending on existing libs like mkl etc.
> You can write performant code in any language if you try.
Unfortunately, only to a certain extent. Sure, if you just need to multiply a handful of matrices and you want your blas ops to be blas'ed where the sheer size of data outweighs any of your actual code, it doesn't really matter. Once you need to implement lower-level logic, ie traversing and processing the data in some custom way, especially without eating extra memory, you're out of luck with Python/numpy and the rest.
I guess this is a pretty legitimate take, but in that case VectorDB looks like (from the got repo) it makes huge use of libraries like pytorch and numpy.
If numpy is fast but "doesn't count" because the operations aren't happening in python, then I guess VectorDB isn't in python either by that logic?
On the other hand, if it is in Python despite shipping operations out to C/C++ code, then I guess numpy shows that can be an effective approach?
BLAS can be implemented in any language. In terms of LOC, most BLAS might be C libraries, but the best open source BLAS, BLIS, is totally structured around the idea of writing custom, likely assembly, kernels for a platform. So, FLOPs-wise it is probably more accurate to call it an assembly library.
LAPACK and other ancillary stuff could be Fortran or C.
Anyway, every language calls out to functions and runtimes, and compiles (or jits or whatever) down to lower level languages. I think it is just not that productive to attribute performance to particular languages. Numpy calls BLAS and LAPACK code, sure, but the flexibility of Python also provides a lot of value.
This is unfortunately not correct once you start pushing the boundaries requiring careful allocation of memory, CPU cache and COU itself, see this table:
I don't accept that. In the referenced article you're pulling in stuff which I believe is written in a different language (probably C). If you use native python, I'm sure you would accept it would be much slower and take up much more memory. So we have to disagree here.
Yes, pure Python is slower and takes up more memory. But that doesn't mean it can't be productive and performant using these types of strategies to speed up where necessary.
With respect, I think you're clouding things by trying to defend what is really defensible. Okay then.
> Where do you draw the line?
Drawing the line at native python, not pulling in packages that are written in another language. Packages written in python only are acceptable in this argument.
> But that doesn't mean it can't be productive and performant using these types of strategies to speed up where necessary.
No one said it couldn't. What we're saying is that it pure python is 'slow' and you need to escape from pure python to get the speedups.
I agree that pure Python isn't as fast as other options. Just comes down to a productivity tradeoff for developers. And it doesn't have to be one or the other.
> Python programmer for 15 years [...] [Python] has no typing
Ok, I have to call this statement out. Mypy was released 15 years ago, so Python has had optional static typing for as long as you've been programming in it, and you don't know about it?
I guess it's going to take another fifteen years for this 2008 trope to die.
I'm primarily a Python programmer, I love mypy and the general typing experience in Python (I think it's better than TypeScript - fight me), but are you seriously comparing it to something - anything - with proper types like Rust?
I used Python type hints and MyPy since long before I used TypeScript, and I have to say that TypeScript's take on types is just plain better (that doesn't mean it's good though).
1. More TypeScript packages are properly typed thanks to DefinitelyTyped. Some Python packages such as Numpy could not be properly typed last I checked, I think it might change with 3.11 though. Packages such as OpenCV didn't have any types last I checked.
2. TypeScript's type system is more complete, with better support for generics, this might change with 3.11/3.12 though.
3. TypeScript has more powerful type system than most languages, as it is Turing-complete and similar in functionality to a purely functional language (this could also be a con)
Your language is fine (I’ve enjoyed your blog posts too, never gave it a thought that English wasn’t your first language), I just thought it was unnecessarily hurtful to say they must be a phony because they didn’t know something.
But, everybody else seems to agree so maybe I’ve been had.
Fair point, then you could claim it's similar to this DB with its reliance on Faiss. Despite that, Chroma at this point is more feature rich.
I was mostly referring to this https://thedataquarry.com/posts/vector-db-1/
You are not wrong about the performance from Rust, but LanceDB is inherently written with performance in mind. SIMD support for both x86 and ARM, and an underlying vector storage approach that's built for speed (Lance)
I've seen a number of projects come over the last couple years. I'm the author of txtai (https://github.com/neuml/txtai) which I started in 2020. How you approach performance is the key point.
Not necessarily a good thing when the product is made by a VC backed startup that may die or pivot in six months leaving you the need to maintain it yourself.
We needed a low latency, on premise solution that we can run on edge nodes with sane defaults that anyone in the team can whim in a sec. Also worth noting is that our use case is end to end retrieval of usually few hundred to few thousand chunks of text (for example in Kagi Assistant research mode) that need to be processed once at run time with minimal latency.
Result is this. We periodically benchmark the performance of different embeddings to ensure best defaults:
I thought the API here was quite neat. It's fairly simple to implement a lancedb backend for it instead of sklearn/faiss/mrpt as the source code is really simple.