Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Faiss GPU author here. Vector binarization via LSH for dense vector, high-dimensional (say 30 - 2000ish dimensions) tends to have poor recall and quality of results. As the Annoy author implied here as well, there's a lot of literature on LSH but that's because it is easier to analyze mathematically and prove theorems regarding it, not that it produces good quality results compared to graph-based (HNSW etc) or different geometric cell probe-based (IVF etc) methods. LSH is not typically used in industrial scale settings for these reasons.

However, for extremely high dimensional sparse vectors (think tens of thousands to millions of dimensions), LSH does remain the only practical approximate technique to my knowledge. Online settings (where you need to insert vectors quickly into the database) may prefer LSH as well as the overhead is smaller than updating a graph or finding the IVF cell in which to place the vector.

Below ~30 dimensions, exact non-brute force methods (like k-D trees or binary space partitioning methods) work ok, but they have high overheads above that because your true nearest neighbor is still highly likely to lie on either side of any dividing hyperplane (it doesn't partition the dataset well; everything is usually "near" everything else in practice in high dimensions).



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: