The thing with humans is they will say “I don’t remember how many syllables a haiku has” and “what the hell is kubernetes?” No LLM can reliably produce a haiku because their lexing process deprives them of reliable information about syllable counts. They should all say “I’m sorry, I can’t count syllables, but I’ll try my best anyway.” But the current models don’t do that because they were trained on texts by humans, who can do haiku, and not properly taught their own limits by reinforcement learning. It’s Dunning Kruger gone berserk.
Eh, it's not D&K gone berserk, it's what happens when you attempt to compress reality down to a single dimension (text). If you're doing a haiku, you will likely subvocalize it to ensure you're saying it correctly. It will be interesting when we get multimodal AI that can speak and listen to itself to detect things like this.
The problem isn’t just that everything is text. It’s that everything is a Fourier transform of text in such a way that it’s not actually possible for an LLM to learn to count syllables.
Imagine you have a lot more computing resources in a multimodal LLM. It sees your request of count the syllables and realizes it can't do them from text alone (hell I can't and have to vocalize it). It then sends your request to a audio module and 'says' the sentence, then another listening module that understand syllables 'hears' the sentence.
This is how it works in most humans, now if you do this every day you'll likely make some kind of mental shortcut to reduce the effort needed, but at the end of the day there is no unsolvable problem on the AI side.