Interesting! So what I am hearing from this is that some day we will overcome this and have programs that can rap. And then companies with customer support bots will have a little check box if you want the bot to rap all responses to you.
Technically, we know exactly what we have to do to GPT-3 to give it a chance to learn how to rhyme: get rid of BPES and just use raw UTF-8 (or even just ASCII) as the encoding. Enlarge the model a bit, as necessary.
At least that's what I am getting from Gwern's write-up. I might have misunderstood Gwern, or Gwern might be wrong, of course.
Replacing BPEs not with characters but with a sylabbary (is that a word? a vocabulary made of possible syllables) would be even more powerful, and you could also run preprocessing to convert text to phonetic transcription, to enforce a concept that spelling is irrelevant but the pronunciation matters.
A syllabary or a phonetic encoding like IPA would patch the rhyme problem, but it would sabotage you in still other ways. People expect a language model to be able to solve anagrams or reverse strings, for example. And there are a lot of things in written language which are connected to the exact spelling but not necessarily the phonetics; there is no list of what those things are or what not learning them would do to a model, and you would have to discover (or more likely, not discover, in the way that people keep not discovering the BPE sabotage) the drawbacks the hard way. So you have a Bitter Lesson tradeoff here: yeah, you can build in that concept instead of learning from raw Unicode/bytes, and it will work initially better, but you are handicapping yourself in the long run. So, I always say go for raw Unicode for the foreseeable future.
That would make the system more complicated by baking in extra assumptions.
And eg it would probably perform worse at translating from English to French than a naive system. (Unless you preprocess your corpus extensively to figure out when you get English and when you get a snippet of French.) GPT-3 is surprisingly good at translating English to French.
Another problem is that English has not just one text-to-phonetic transcription: different accents pronounce words differently. For just one example:
> n most non-rhotic accents, if a word ending in written "r" is followed immediately by a word beginning with a vowel, the /r/ is pronounced, as in water ice. That phenomenon is referred to as "linking R". Many non-rhotic speakers also insert an epenthetic /r/ between vowels when the first vowel is one that can occur before syllable-final r (drawring for drawing). The so-called "intrusive R" has been stigmatized, but many speakers of Received Pronunciation (RP) now frequently "intrude" an epenthetic /r/ at word boundaries, especially if one or both vowels is schwa. For example, the idea of it becomes the idea-r-of it, Australia and New Zealand becomes Australia-r-and New Zealand, the formerly well-known India-r-Office and "Laura Norder" (Law and Order). The typical alternative used by RP speakers (and some rhotic speakers as well) is to insert a glottal stop wherever an intrusive R would otherwise have been placed.
Btw, even without any explicit pronunciation data, I would expect the system to get a pretty good idea of how pronunciation works by essentially using the same technique human linguists use to reconstruct old pronunciations:
They observe what kinds of mistakes people make.
Eg when you see people mixing up "there" and "their" and "they're" in the corpus, that tells you that in modern English these three are pronounced almost the same.
From spelling mistakes in words like carburetor you can figure out that unstressed vowels in English are pretty much all pronounced the same: as a schwa. https://en.wikipedia.org/wiki/Schwa
But you have to make those assumptions if you want to make a model for rhyming because large quantities of raw prose does not contain any information whatsoever about which words rhyme, how they are pronounced or where the accents lie. And of course that would be worse for other tasks, that's kind of the whole point - discarding irrelevant information is the key part that distinguishes learning from memorizing; for most use cases you do want to discard the pronunciation part and other nuances of representation to focus on the semantics, but for some (like this) you may want to discard some other parts in order to focus on how the language sounds like. The 'no free lunch theorem' is conceptually relevant even if it doesn't literally apply here.
Your example of their/they're/there has some data because the whole words can be mistaken for each other, but even if you take a billion pages of prose, you won't get data to deduce that 'there' rhymes with 'pear' but not 'here', that 'great' rhymes with 'straight' but 'cough', 'dough' and 'through' do not rhyme. A model can't learn something that's simply not represented in the training data.
So you either have to bring in external information (i.e. pronunciation models, or dictionary data with pronunciation guides, or audio recordings) or you have to have sufficient rhyming text inside your training data - i.e. train it on corpora of poetry instead of prose.
Also, I'm not sure if the variation of pronunciation is critical - any accent variations that affect every instance of some sound similarly would preserve rhyming. There are certain differences e.g. I recall a discussion about some parts of Shakespeare which rhyme perfectly in pronunciation of that period but do not in modern English, but I think that the variations of modern English accents should be mostly OK from that perspective.
> So you either have to bring in external information (i.e. pronunciation models, or dictionary data with pronunciation guides, or audio recordings) or you have to have sufficient rhyming text inside your training data - i.e. train it on corpora of poetry instead of prose.
Yes, the latter. Just mix some rhyming text into your corpus. If you take eg Wikipedia, there's already plenty of rhyming stuff in there. Nursery rhymes, song lyrics, etc. Similar for other corpora.