Product Quantization: Compressing high-dimensional vectors

microtonal · on Sept 2, 2021

Nice article that explains product quantization very well, thanks!

PQ is really a nice compression technique. I implemented PQ and Optimized PQ [1] a while back in our word embedding package for Rust:

https://github.com/finalfusion/finalfusion-rust/

https://github.com/finalfusion/reductive/

Particularly Optimized PQ was effective in reducing vector sizes ~10 times with virtually no reconstruction loss. This made it much easier to ship models (no more 3GB embedding matrix with a neural net that is just a few megabytes large).

[1] http://kaiminghe.com/publications/pami13opq.pdf

amelius · on Sept 2, 2021

> with virtually no reconstruction loss

I suppose that depends on the particular application.

q_andrew · on Sept 2, 2021

Having a passing knowledge of how ML works, I always wondered how these algorithms can use such dense data structures -- it makes a lot more sense now that I know they can be compressed and still searchable.

uoaei · on Sept 2, 2021

High-dimensional spaces are common in deep learning nowadays, and there's lots of room in high-dimensional spaces so any potential quantization errors are probably insulated from incorrect recall by virtue of the fact that things are so far apart in the original space.

thomasmg · on Sept 2, 2021

I'm not sure what it means to have a recall of 50%... For nearest neighbour search, you would only find 5 of the top 10 nearest entries? What would be the algorithm of choice if I need to find 9 out of the top 10? Would it be LSH? (In our case, we have 1024 features, and plan to move to 2048; this is for image similarity search; a few million images)

jamesbriggs · on Sept 2, 2021

Given the top 100 results produced by a flat index - how many of those are returned in the top 100 of our PQ/IVFPQ index. Although it's worth noting that this is at the lower end of performance for PQ, using the same Sift1M dataset we can hit 76% with IVFPQ using nlist=256, m=32, and nbits=8. I'm sure this could be increased with a little more fine-tuning too. A couple of options:

* IVFPQ seems reasonable but unlikely to hit 90%

* HNSW could be a fast and accurate alternative

* a Flat IVF index could definitely reach 90%, but it may be slow

liuliu · on Sept 2, 2021

My understanding: it normally sits at the bottom of the ranking algo. You are looking at whether the final top 10 will be in the top 10,000 (out of tens of millions) most of the time. After getting the top 10,000, you can do more finer-grain ranking to get the final top 10.

thomasmg · on Sept 3, 2021

Yes, so you get the top 10'000 using e.g. an IVF+PQ index, then from those get the top 10 using brute-force.

pbadams · on Sept 2, 2021

In addition to these other answers, a commonly used metric in the literature is the k-Recall@R, or the portion of the top k ground-truth vectors that are found in the top R ANN queries. This parametrization can make it clearer when comparing between different approaches

For your case, you might find it helpful to look at the excellent ann-benchmarks.com, and in particular the benchmarks for GIST, which is an image embedding.

youssefabdelm · on Sept 2, 2021

Can something like this be used in training as a way of temporarily 'compressing' parameter count?

samplenoise · on Sept 2, 2021

Doing some form of PQ at train time is possible but typically the goal is to make the model’s embedding layers more robust to quantisation [1][2]. I did some work on this in the recommender systems space [3].

[1] https://arxiv.org/abs/1807.04629

[2] http://proceedings.mlr.press/v119/chen20l/chen20l.pdf

[3] http://ceur-ws.org/Vol-2431/paper10.pdf

youssefabdelm · on Sept 2, 2021

Really impressive and intriguing work, thanks for sharing!

I'd be specifically curious about applying PQ to transformers. It's quite depressing to me that the ultra-large-scale model training is inaccessible to the average poor person like me. The dream would be to figure out some method of compressing the parameter count significantly (or rather make models more efficient) such that it is possible to train and/or run 100 billion/1/10/100trillion+ parameter models on Colab say, as 'crazy' as that sounds to most probably.

shoo · on Sept 2, 2021

Vector quantisation is discussed briefly in the Russell & Norvig section on speech recognition. I'll paraphrase:

In speech recognition a sub-problem is to figure out how to encode the conditional distribution P(features|phone), where features is an n dimensional vector, say n in the hundreds, and there are about 50 possible values of phone. The simplest way to encode P(features|phone) would be as a lookup table -- but that is impractical for n > 2. The authors describe two ways to proceed:

Option A: discretise the feature space -- apply vector quantisation, to partition the feature space to e.g. 256 regions. Replace each feature vector with a region label 1...256. Then the distribution P(features|phone) can be approximated as P(label|phone) and stored as a lookup table.

Option B: Don't discretise the feature space. Instead, approximate the distribution P(features|phone) by a mixture of k n-dimensional Gaussians, at the cost of storing k(n + n^2) parameters for the mean and covariance matrix of each Gaussian in the mixture.

Russel and Norvig state "vector quantization is no longer popular in large-scale systems" (this is from my ~2003 second edition of the book, in the context of speech recognition).

Can anyone who has more familiarity with the two applications explain why vector quantisation fell out of favour in speech recognition vs mixtures of Gaussians, and why vector quantisation is used in document/image similarity search? Are there approaches to document/image similarity search using continuous distributions that are competitive with vector quantisation approaches?

Or are these two problems fundamentally quite different as the speech subproblem is trying to infer an state out of a very small state space (50 phones) while document search may be trying to infer a state out of a larger state space (millions of documents).

jszymborski · on Sept 2, 2021

Very interesting. Can anyone tell me what the relationship between PQ and k-d trees are? I'm too tired to fully understand if they're two sides of the same coin or just related.

uoaei · on Sept 2, 2021

Related, in that they both create partitions of the space. Different in that the "leaves" (clusters induced by learned centroids) have arbitrary shapes, and the query relies on using the same tree to localize the point in the different (and disjoint) subspaces it occupies.

jszymborski · on Sept 3, 2021

Fascinating, thanks!

kalimatas · on Sept 2, 2021

I wonder what tool was used to create such beautiful diagrams.

polyrand · on Sept 2, 2021

They seem to be done using Excalidraw [0]. I'm not 100% sure about the animated one, probably excalidraw-animate [1]

[0] https://excalidraw.com/ [1] https://github.com/dai-shi/excalidraw-animate

jamesbriggs · on Sept 2, 2021

Excalidraw + Adobe Animate - I have never seen Excalidraw-animate before though so this could be new option!

gk1 · on Sept 2, 2021

Yep, you got it. We're big fans of Excalidraw!

macrolime · on Sept 2, 2021

How does this compare to Facebook AI Similarity Search?

gk1 · on Sept 2, 2021

Faiss is a vector search library that contains different index types, including one with product quantization (IVF-PQ) as shown in the article. It makes it easier to work these indexes.

See this intro to Faiss: https://www.pinecone.io/learn/faiss-tutorial/

In case you meant how Pinecone compares to Faiss, the answer is similar to above. Pinecone is a managed vector search service that contains vector search libraries. It has added features, an API, and a managed + distributed infrastructure to make it easy to run vector search at scale.

See comparison here: https://www.pinecone.io/managed-faiss/