One classic old estimate of the amount of information per English word, from Claude Shannon, is 11.82 bits per word. So very roughly, a 32-token text – the length they highlight to demonstrate their technique – should be representable in about 379 bits.
One of the embeddings they demonstrate the use of their technique against is the `text-embedding-ada-002` OpenAI offering, which gives back a 1,536-dimension representation, where every dimension is a floating-point number.
So as a matter of theory, just in the sign-bits of those vectors, there's enough raw state to potentially encode 32-word texts, and more.
If those float-dimensions are 4-byte floats, as are common, a single `text-embedding-ada-002` text vector takes (1536 * 4 bytes =) 6,144 bytes of storage (49,152 bits), While the dense/continuous nature of these values, and all the desirable constraints/uses packed into them, means you won't be getting that much precise/lossless text-representation from the values, it's plenty spacious for capturing a pretty-good compression of some pretty-long texts.
The interesting thing here is how often that short texts can be perfectly or nearly-perfectly recovered, via the authors' iterative method – even without that being an intended designed-in capability of the text embedding.
You could also see the result as an example of the old saw, "compression is intelligence" (or vice-versa) – an insight which has motivated, among other things, the Hutter Prize.
Some interesting questions for future work on the same codebase could be:
• at what text length does accuracy peak, and at what length does it precipitously decline?
• are even the failures still reasonable summaries of the original text?
• if for some applications such recovery is undesirable – in many it's no problem – are there ways to change the models, with different/extra training/ablation/whatever, that retains other aspects of the embeddings' usefulness but impairs verbatim text recovery?
Really insightful comment. You make a good point that if the embeddings were optimized for compression, we could probably store much longer texts; the interesting thing is that these embeddings are designed & trained to do something else (search / semantic similarity) yet we discover that they're still encoding the input pretty well.
I really like your follow-up research questions, here are my thoughts in order:
- I'm also curious at what point text becomes difficult to reconstruct. Ada claims to embed text up to 8192 tokens, which seems quite long to me. However training a decoder that can decode text that's that long would take a lot of compute so we haven't tried it yet.
- This one we actually tested, and it's buried in the analysis section of our paper. The TLDR is that the minimum BLEU score for reconstruction is I think around 20, which is actually pretty high. So for all reconstructed inputs, we get pretty close, but miss some nouns, swap word order, and get some syntax slightly wrong which drops BLEU from 100 to 20 on that specific input and kills our exact match for some long out-of-domain texts.
- The question of whether designing privacy-preserving embeddings is even possible is also very compelling to me, and I don't know the answer. My instinct (and what some folks have indicated on Twitter in the last 24hr) is that it's really hard to preserve privacy and search performance at the same time, and there has to be some kind of tradeoff depending on how fuzzy you're willing to make the embeddings. But I'd love to be proved wrong on this.
With regard to privacy-preservation, there's also potential from the oblivious-retrieval/homomorphic-encryption/zk-proof/etc worlds that one might be able to offer a frozen model/embedding-set, and prove to clients that…
(1) their text was faithfully converted to an embedding according a disclosed technique & stable (but private) set of model-weights; and
(2) here are the top-N neighbors doc-IDs/docs by similarity
…without either the client receiving the full model weights or even their text's precise embedding, nor the server seeing a cleartext version of the client's text or any usable info about what it contains.
First author here! I thought it was interesting that so many people thought this was obvious after-the-fact. What we're trying to convey is that although inversion turns out to be possible, it's not trivial.
Here are the three levels of inversion model described in the paper:
(1) Training a decoder to invert embeddings gives you some similar text, but only with some overlapping words.
(2) Training an encoder-decoder with the embedding projected up to a 'sequence' for the encoder works better, and gets more words back, but almost never gets the right text back exactly, especially when sequences are long.
(3) Training this fancy architecture as a 'correction' model that recursively corrects and re-embeds text works really well. It turns out that search is necessary for this kind of thing, and can get many texts back exactly.
A fourth option would be to simply scale the inversion model up by 10x or 100x, which would give us predictably better performance. We didn't try this because it's just not that interesting.
Hey man this is going to sound really weird but when I just saw your name, it struck me as super familiar. Like I know I've seen your name on GitHub somewhere. I started going through your public Repos and I figured it out!
The first commit I ever made was on a fork of one of your projects over 5 years ago. I wasn't a software engineer then, I was just some guy learning to code in his free time.
Anyways I just wanted to say thank you. I know it may seem silly. But your code gave me the scaffolding I needed to take some important first steps towards where I am today. I'm now a senior software engineer/team lead!
Wow, that’s really wild! Yeah, I remember this — 8 years ago Pokémon go came out and I was working on some projects to reverse engineer the API. Now I’m reverse engineering neural networks. Go figure.
I had an encounter like this in real life. Turns out Tom Scott was a really early contributor to POPFile. He was a teenager and I’d never met him. Had been enjoying his YouTube channel for a long time. One day he interviewed me at some conference and thanked me for accepting his commits when he was a teen.
> A fourth option would be to simply scale the inversion model up by 10x or 100x, which would give us predictably better performance. We didn't try this because it's just not that interesting.
To train a larger inversion model, I think we'll just need 16 A100s for a month or two. We can circle back in December, once the bigger model finishes training, to see that we've gotten better reconstruction performance. Fascinating!
Great paper! I have a question on how much this can be attributed to portions of text which GPT has already seen. Like if the same approach was followed on a new sliver of Wikipedia which GPT had not seen would the results be the same?
Why wouldn't they? If you think of embeddings as a learned hash, and your hash space is wide enough, the embedding would simply be another, lossless representation of the input. The challenge, of course, is that inverting hashes is difficult. Except in machine learning, the hashes are typically intended not to be so; to preserve semantic and syntactic relationships, as word2vec famously demonstrated. And there are even text embeddings that use sub-word information like character n-grams, which can trivially represent rare words, like the kind embodied by personal information.
edit: Given the author agrees, I suppose the research question is how well and cheaply you can do it, across different embedding algorithms. For practitioners, the lesson is to treat embeddings like personally identifiable information, as the authors state in the conclusion.
Agreed. Embeddings are pretty big > 1024 * 4 bits. And language is really small: ~1 bits per character. So it's not at all crazy that embeddings can be lossless. The paper shows a practical method to recover the text and shows how it applies generally. Here's a demo of how it works https://twitter.com/srush_nlp/status/1712559472491811221
Diffusion based image generation is a kind of ‘reversing the embedding’, right?
The iterative error correction applied here almost feels like repeated noise reduction, too.
So does this mean you could use this sort of text recovery to do the same kinds of things image diffusers do, traversing the latent space and recovering images along a path - but instead recovering text that morphs?
Could we finally use this to discover the text that lies exactly halfway between the Declaration of Independence and the lyrics of ‘Baby Got Back’?
I guess not until it scales to more than 32 tokens. But we can dream.
It demonstrates a neat model that's specifically designed to let you embed text, manipulate the embeddings (combine them, average them or whatever) and then turn them back into text again. It's fascinating.
Just want to note that the difference in this paper is that it works without direct access to the embedding models (encoder). So it can't design the embedding space.
Thanks for sharing Simon! I will note that by training an adapter layer between this autoencoder's embedding space and OpenAI's, it's possible to recover a significant amount of detail from text-embedding-ada-002's embeddings with this model too[0]. But as the paper author's reply in a different thread points out, their iterative refinement approach is able to recover much more detail in their research with a smaller model.
During the Word2Vec era, I used to average a few word embeddings to get centroid embeddings. My observation was that the average embed was still close to all the original embeddings up to 5 concepts. I tested with similarity search. Can't pack too many distinct meanings into a single embed, but you can pack a few.
My only gripe with word embeds was that they were mixing synonymy and relatedness. Even worse, mixing up synonymy with antonymy - hot and cold are similar in a way, but also completely opposite.
A vector representation of a text based on its meaning, generated by an ML model. It's the data format used as input for LLMs and can be used to compare the meaning of texts by comparing the vectors with each other.
Is it as simple as the euclidean distances of the vectors in N-dimensional space is intended to approximate the difference in meaning between two texts?
It's a machine learning method for turning a sample of text into a fixed-length vector. We usually try to find embeddings with the property that samples of text with similar meanings are mapped to vectors that are close to one another.
The claim of the paper is that you can store it losslessly! If you assume you have access to an LLM for free, then text is extremely compressible. Storing it in an embedding would be plenty of bits.
This is how you would implement a "medium term" memory. Folks in the "sentence-transformer" world have known this forever, yet the wider NLP world ignores it in the context of "chatbots" despite how powerful that and related concepts like "soft-prompts/textual inversion" are.
It's a wonderful technique and the fact that it's not used in ChatGPT and other tools like it is a shocking shame.
One of the embeddings they demonstrate the use of their technique against is the `text-embedding-ada-002` OpenAI offering, which gives back a 1,536-dimension representation, where every dimension is a floating-point number.
So as a matter of theory, just in the sign-bits of those vectors, there's enough raw state to potentially encode 32-word texts, and more.
If those float-dimensions are 4-byte floats, as are common, a single `text-embedding-ada-002` text vector takes (1536 * 4 bytes =) 6,144 bytes of storage (49,152 bits), While the dense/continuous nature of these values, and all the desirable constraints/uses packed into them, means you won't be getting that much precise/lossless text-representation from the values, it's plenty spacious for capturing a pretty-good compression of some pretty-long texts.
The interesting thing here is how often that short texts can be perfectly or nearly-perfectly recovered, via the authors' iterative method – even without that being an intended designed-in capability of the text embedding.
You could also see the result as an example of the old saw, "compression is intelligence" (or vice-versa) – an insight which has motivated, among other things, the Hutter Prize.
Some interesting questions for future work on the same codebase could be:
• at what text length does accuracy peak, and at what length does it precipitously decline?
• are even the failures still reasonable summaries of the original text?
• if for some applications such recovery is undesirable – in many it's no problem – are there ways to change the models, with different/extra training/ablation/whatever, that retains other aspects of the embeddings' usefulness but impairs verbatim text recovery?