One classic old estimate of the amount of information per English word, from Cla...

jxmorris12 · on Oct 13, 2023

Really insightful comment. You make a good point that if the embeddings were optimized for compression, we could probably store much longer texts; the interesting thing is that these embeddings are designed & trained to do something else (search / semantic similarity) yet we discover that they're still encoding the input pretty well.

I really like your follow-up research questions, here are my thoughts in order:

- I'm also curious at what point text becomes difficult to reconstruct. Ada claims to embed text up to 8192 tokens, which seems quite long to me. However training a decoder that can decode text that's that long would take a lot of compute so we haven't tried it yet.

- This one we actually tested, and it's buried in the analysis section of our paper. The TLDR is that the minimum BLEU score for reconstruction is I think around 20, which is actually pretty high. So for all reconstructed inputs, we get pretty close, but miss some nouns, swap word order, and get some syntax slightly wrong which drops BLEU from 100 to 20 on that specific input and kills our exact match for some long out-of-domain texts.

- The question of whether designing privacy-preserving embeddings is even possible is also very compelling to me, and I don't know the answer. My instinct (and what some folks have indicated on Twitter in the last 24hr) is that it's really hard to preserve privacy and search performance at the same time, and there has to be some kind of tradeoff depending on how fuzzy you're willing to make the embeddings. But I'd love to be proved wrong on this.

gojomo · on Oct 13, 2023

With regard to privacy-preservation, there's also potential from the oblivious-retrieval/homomorphic-encryption/zk-proof/etc worlds that one might be able to offer a frozen model/embedding-set, and prove to clients that…

(1) their text was faithfully converted to an embedding according a disclosed technique & stable (but private) set of model-weights; and

(2) here are the top-N neighbors doc-IDs/docs by similarity

…without either the client receiving the full model weights or even their text's precise embedding, nor the server seeing a cleartext version of the client's text or any usable info about what it contains.

moralestapia · on Oct 14, 2023

>we could probably store much longer texts

Or significantly reduce the # of dimensions used on an embedding.