One classic old estimate of the amount of information per English word, from Claude Shannon, is 11.82 bits per word. So very roughly, a 32-token text – the length they highlight to demonstrate their technique – should be representable in about 379 bits.
One of the embeddings they demonstrate the use of their technique against is the `text-embedding-ada-002` OpenAI offering, which gives back a 1,536-dimension representation, where every dimension is a floating-point number.
So as a matter of theory, just in the sign-bits of those vectors, there's enough raw state to potentially encode 32-word texts, and more.
If those float-dimensions are 4-byte floats, as are common, a single `text-embedding-ada-002` text vector takes (1536 * 4 bytes =) 6,144 bytes of storage (49,152 bits), While the dense/continuous nature of these values, and all the desirable constraints/uses packed into them, means you won't be getting that much precise/lossless text-representation from the values, it's plenty spacious for capturing a pretty-good compression of some pretty-long texts.
The interesting thing here is how often that short texts can be perfectly or nearly-perfectly recovered, via the authors' iterative method – even without that being an intended designed-in capability of the text embedding.
You could also see the result as an example of the old saw, "compression is intelligence" (or vice-versa) – an insight which has motivated, among other things, the Hutter Prize.
Some interesting questions for future work on the same codebase could be:
• at what text length does accuracy peak, and at what length does it precipitously decline?
• are even the failures still reasonable summaries of the original text?
• if for some applications such recovery is undesirable – in many it's no problem – are there ways to change the models, with different/extra training/ablation/whatever, that retains other aspects of the embeddings' usefulness but impairs verbatim text recovery?
Really insightful comment. You make a good point that if the embeddings were optimized for compression, we could probably store much longer texts; the interesting thing is that these embeddings are designed & trained to do something else (search / semantic similarity) yet we discover that they're still encoding the input pretty well.
I really like your follow-up research questions, here are my thoughts in order:
- I'm also curious at what point text becomes difficult to reconstruct. Ada claims to embed text up to 8192 tokens, which seems quite long to me. However training a decoder that can decode text that's that long would take a lot of compute so we haven't tried it yet.
- This one we actually tested, and it's buried in the analysis section of our paper. The TLDR is that the minimum BLEU score for reconstruction is I think around 20, which is actually pretty high. So for all reconstructed inputs, we get pretty close, but miss some nouns, swap word order, and get some syntax slightly wrong which drops BLEU from 100 to 20 on that specific input and kills our exact match for some long out-of-domain texts.
- The question of whether designing privacy-preserving embeddings is even possible is also very compelling to me, and I don't know the answer. My instinct (and what some folks have indicated on Twitter in the last 24hr) is that it's really hard to preserve privacy and search performance at the same time, and there has to be some kind of tradeoff depending on how fuzzy you're willing to make the embeddings. But I'd love to be proved wrong on this.
With regard to privacy-preservation, there's also potential from the oblivious-retrieval/homomorphic-encryption/zk-proof/etc worlds that one might be able to offer a frozen model/embedding-set, and prove to clients that…
(1) their text was faithfully converted to an embedding according a disclosed technique & stable (but private) set of model-weights; and
(2) here are the top-N neighbors doc-IDs/docs by similarity
…without either the client receiving the full model weights or even their text's precise embedding, nor the server seeing a cleartext version of the client's text or any usable info about what it contains.
One of the embeddings they demonstrate the use of their technique against is the `text-embedding-ada-002` OpenAI offering, which gives back a 1,536-dimension representation, where every dimension is a floating-point number.
So as a matter of theory, just in the sign-bits of those vectors, there's enough raw state to potentially encode 32-word texts, and more.
If those float-dimensions are 4-byte floats, as are common, a single `text-embedding-ada-002` text vector takes (1536 * 4 bytes =) 6,144 bytes of storage (49,152 bits), While the dense/continuous nature of these values, and all the desirable constraints/uses packed into them, means you won't be getting that much precise/lossless text-representation from the values, it's plenty spacious for capturing a pretty-good compression of some pretty-long texts.
The interesting thing here is how often that short texts can be perfectly or nearly-perfectly recovered, via the authors' iterative method – even without that being an intended designed-in capability of the text embedding.
You could also see the result as an example of the old saw, "compression is intelligence" (or vice-versa) – an insight which has motivated, among other things, the Hutter Prize.
Some interesting questions for future work on the same codebase could be:
• at what text length does accuracy peak, and at what length does it precipitously decline?
• are even the failures still reasonable summaries of the original text?
• if for some applications such recovery is undesirable – in many it's no problem – are there ways to change the models, with different/extra training/ablation/whatever, that retains other aspects of the embeddings' usefulness but impairs verbatim text recovery?