Thing is, when you consider the tasks you actually want to optimize the models for, quite a few things mentioned in this discussion - e.g. correctly learn how to capitalise, make all-caps, count syllables, act on specific counts of letters - fall in the category of uninteresting things you don't want to waste parameters on. Sure, they'd help with some trick questions that refer to the peculiarities of how exactly we encode stuff in letters, but that's the whole thing we want to abstract away, going beyond textual encoding (or verbal encoding or pictures as rectangles of pixels) towards what the utterance means - like, not only we want to abstract away from spelling mistakes or variations, but also much larger changes to text like different grammar structures to say the same thing, or even saying the same thing in a different language in a different alphabet.