Yup. Most common words have several tokens - the word, the word with a capital letter, the word with a leading space and sometimes the word all in caps too.
I wonder if the embeddings could be explicitly configured to account for these “symmetries”. E.g.: instead of storing seperate full copies of the “variants”, maybe keep a reduced representation with a common prefix and only a small subset of the embedding vector that is allowed to be learned?
This could force the model to correctly learn how to capitalise, make all-caps, etc…
There was some discussion of doing this for RKVW, but I don't think it has actually been implemented yet.
The goal is simply to speed up training slightly, it wouldn't actually make a difference to the final performance of a model as big as GPT-4 (except maybe decrease the prevalence of glitch tokens)
> wouldn't actually make a difference to the final performance
Doesn't that assume that the embeddings learned are in some sense "perfect"? Is that actually the case in practice?
I would expect the learned embeddings to have some errors, especially for the rarer ones that have few examples available for the model to learn from.
I also thought that explicitly accounting for symmetries always improved model performance, because then it doesn't waste parameters learning things that aren't unique and interesting pieces of information.
Thing is, when you consider the tasks you actually want to optimize the models for, quite a few things mentioned in this discussion - e.g. correctly learn how to capitalise, make all-caps, count syllables, act on specific counts of letters - fall in the category of uninteresting things you don't want to waste parameters on. Sure, they'd help with some trick questions that refer to the peculiarities of how exactly we encode stuff in letters, but that's the whole thing we want to abstract away, going beyond textual encoding (or verbal encoding or pictures as rectangles of pixels) towards what the utterance means - like, not only we want to abstract away from spelling mistakes or variations, but also much larger changes to text like different grammar structures to say the same thing, or even saying the same thing in a different language in a different alphabet.
Try searching for different words using the search box here: https://observablehq.com/@simonw/gpt-tokenizer#cell-135