When using BERT (and all the many things like it, such as earlier ELMO, ULMfit a...

Erlich_Bachman · on Jan 14, 2020

> in the 'current style' you have "thick embeddings" which is running through all the many transformer layers in a pretrained BERT-like system, followed by a thin custom layer that's often just glorified linear regression.

Would you say they are still usually called "embeddings" when using this new style? This sounds more like just a pretrained network which includes both some embedding scheme and a lot of learning on top of it, but maybe the word "embedding" stuck anyway?

PeterisP · on Jan 14, 2020

They do seem to be still called "embeddings" although yes, that's become a somewhat misleading misonmer in some sense.

However, the analogy still is somewhat meaningful, because if you want to look at the properties of a particular word or token, it's not just a general pretrained network, it still preseves the one-to-one mapping between the input token and the output vector corresponding to each particular token; which is very important for all kinds of sequence labeling or span/boundary detection tasks. So you can use them just as word2vec embeddings - for example, if you'd do word similarity or word difference metrics with 'transformer-stack-embeddings' then that would work just as well as word2vec (though you'd have to get to a word-level measurement instead of wordpiece or BPE subword tokens) with the added bonus of having done contextual disambiguation; you probably could do a decent word sense disambiguation system just by directly clustering these embeddings; the mouse-as-animal and mouse-as-computer-peripheral should have clearly different embeddings.