* Which distance functions does this support? Looks like it supports binary vectors already -- is Hamming distance supported?
* How does the performance compare with sqlite-vss? I'm curious about the profiling numbers -- both in terms of query speed, as well as memory usage.
Overall, this looks absolutely fantastic, and I love the direction you're heading with all of this.
> Though initially, sqlite-vec will only support exhaustive full-scan vector search. There will be no "approximate nearest neighbors" (ANN) options. But I hope to add IVF + HNSW in the future!
I think this is 1000% the correct approach -- kudos for not over-complicating things initially! I've shipped on-device vector search (128-bit binary vectors, Hamming distance) and even with a database size of 200k+ entries, it was still fast enough to do full brute-force distance search on every camera frame -- even running on crappy phones it was fast enough to get 10+ fps, and nicer phones were buttery-smooth. It's amazing how frequently brute-force is good enough.
That said, for implementing ANN algorithms like HNSW and whatnot, my first thought is that it would be slick if these could be accomplished with a table index paradigm -- so that switching from brute-force to ANN would be as simple as creating an index on your table. Experimenting with different ANN algorithms and parameters would be accomplished by adjusting the index creation parameters, and that would let developers smoothly evaluate and iterate between the various options. Maybe that's where your mind is going with it already, but I figured I would mention it just in case.
re distance functions: current L2+cosine for float/int8 vectors, and hamming for bit vectors. There are explicit vec_distance_l2()/vec_distance_cosine()/vec_distance_hamming() SQL functions, and the vec0 table will implicitly call the configured distance function on KNN queries.
re comparison to sqlite-vss: In general, since sqlite-vss uses Faiss, it's much faster at KNN queries, and probably faster at fullscans. Faiss stores everything in memory and uses multithreading for >10k vectors, so hard to beat that. sqlite-vec on the other hand doesnt use as much memory (vectors are read chunk-by-chunk), but it still relatively fast. There are SQLite settings like page_size/mmap_size that can make things go faster as well.
For writes (ie INSERT/UPDATE/DELETE), sqlite-vec is much much faster. sqlite-vss requires a full index re-write for all writes, even single vector stuff. sqlite-vec on the other hand only writes to the effected vectors, so it's MUCH more performant than sqlite-vss in that workflow specifically. I view it as sqlite-vss is more OLAP focused (many fast read and slow writes), and sqlite-vec is more OLTP focused (fast-enough reads and fast writes).
I agree, brute-force hamming distance is surprisingly super fast, especially for resource-constraint environments! Really looking forward to more embedding models to properly support binary vectors.
And yea, I'm still mulling over how ANN queries will work. My initial thought would be to add it to the vector column definition, like:
CREATE VIRTUAL TABLE vec_items USING vec0(
title_embeddings float[768] indexed by HNSW(m=32)
)
Or something like that. But that means you'd need to recreate the table from scratch if you wanted to change the index. SQLite doesn't have custom indexes, so it's tricky for virtual tables. Then there's the question of how you train it to begin with, so lots to explore there.
* Which distance functions does this support? Looks like it supports binary vectors already -- is Hamming distance supported?
* How does the performance compare with sqlite-vss? I'm curious about the profiling numbers -- both in terms of query speed, as well as memory usage.
Overall, this looks absolutely fantastic, and I love the direction you're heading with all of this.
> Though initially, sqlite-vec will only support exhaustive full-scan vector search. There will be no "approximate nearest neighbors" (ANN) options. But I hope to add IVF + HNSW in the future!
I think this is 1000% the correct approach -- kudos for not over-complicating things initially! I've shipped on-device vector search (128-bit binary vectors, Hamming distance) and even with a database size of 200k+ entries, it was still fast enough to do full brute-force distance search on every camera frame -- even running on crappy phones it was fast enough to get 10+ fps, and nicer phones were buttery-smooth. It's amazing how frequently brute-force is good enough.
That said, for implementing ANN algorithms like HNSW and whatnot, my first thought is that it would be slick if these could be accomplished with a table index paradigm -- so that switching from brute-force to ANN would be as simple as creating an index on your table. Experimenting with different ANN algorithms and parameters would be accomplished by adjusting the index creation parameters, and that would let developers smoothly evaluate and iterate between the various options. Maybe that's where your mind is going with it already, but I figured I would mention it just in case.
Overall, awesome writeup, and fantastic project!!