It might have memorized a summary of the book from the data it's been trained on. GPT-3 is impressive, but it's kind of hard to control exactly what it's going to do. It recognized what you input was from pride and prejudice, remembered a summary of the book from somewhere, and turned that into a rap, which is both impressive and not really what you asked for.
Deep learning theory has shown that these huge neural networks can memorize basically any dataset they're given. The useful ones find ways to interpolate everything they've memorized to new contexts.
But they seem terrible at anticipating what people want or find entertaining, which is what a lot of human interaction and entertainment is about. GP wanted a rap but got a shitty poem.
That's because it's just a language model. It's been trained to find a probable completion to a piece of text, predict a likely next word. It's not trained for human interaction. It's not an agent. It has no motives or goals. It might seem like it does, but that's more of a side-effect of language modelling.
It's beyond me that Facebook squandered its data value by half in not including even a privately tracked 'dislike' button.
The value in what people both like and dislike is going to be massive in the next few years, as it can help train the evaluation of content in both general and personalized scales.
I'd wager with where the tech is today we Tinder could generate synthetic matches that would be far preferred over any natural matches, even at equal or lower physical attractiveness scores in generated photos.
Pandora might be relevant again for music soon as their like/dislike data could allow for the generation of new music in line with preferred tastes and not just selection.
Transistor models like GPT-3 are only part of the equation, and as we continue to improve not only generation of new data, but evaluation of data too -- the application of the technology is likely going to move faster than anything we've seen before in the next few years.
Codex generated 35% of the code written by people with access to it, and Microsoft and OpenAI have the data on what was and wasn't selected. That data alone could produce an AI linter that could have significant value in catching sophisticated natural programmer errors - but in combination with Codex too?
(Keep in mind it was only summer 2020 when the tweet showcasing normal GPT-3 writing HTML was blowing minds, and only last summer Copilot was announced.)
We really haven't seen anything yet.
Things are moving so fast that anyone building products relying on 3rd party AI at a foundational level right now should absolutely be concerned with how they are planning around obsolescence within ~3 years for whatever they are building on top of in their product design and implementation.
Dislike buttons create a feeling of negativity that turned users off engaging with Facebook, which would have meant their overall data collection ability dropped. The loss of the button itself is not so big because they have other ways to measure engagement and can probably predict quite well which items someone would have disliked anyway.
Obligatory mention: GPT-3 cannot learn to rhyme, it can only brute-force memorize rhyme-pairs, because the BPE encoding erases phonetic information from the encoding of words that a GPT-3 (and similar models like Gopher or GPT-J or Jurassic-1) is trained on. https://www.gwern.net/GPT-3#bpes
This is why that rap looks so weird: so high-quality in general, yet so totally lacking a critical element we expect from rap. GPT-3 has a pretty good idea how to write rap, but it just doesn't know what words sound like. It is blind to that. It can't see that difference between rap and poems. To it, rap is just a sort of poetry, and some words are chosen for reasons it can't understand, no matter how many rap lyrics or poems it looks at.
The better the models get at writing, the more this difference throws us for a loop: "how can it be so good at ABC, and yet fail so totally at D?"
Totally disagree. The encoding scheme could give the model an inductive bias which would make rhyming easier. But the lack of such an inductive bias in no way prevents it from understanding rhymes. This model has 10^11 parameters! The embeddings for the encoding are maybe 0.1% of that - probably less I’d guess without looking it up. Converting to a different encoding would use a very small fraction of the model’s computational power.
I'm not talking about inductive bias. I am talking about it being erased from the data before inductive biases or parameters ever have a chance to matter. What you are proposing is tantamount to expecting a model to be able to generate colorful Impressionist masterpieces while trained on solely monochrome bitmaps. Yeah, doesn't work like that.
Doesn't matter how many parameters it has; and as a matter of fact, the GPT models have not improved their rhyming noticeably even as they go from a hundred million parameters to over a hundred billion parameters.
(And I have extensive evidence consistent with my interpretation at the link, ranging from several years of failure by many people to prove me wrong by eliciting rhymes even a fraction as good as its non-rhyme poetry to fixing GPT-3 performance on tasks by working around BPEs to different models by different groups but the same data encoding also showing the same utter failure to rhyme.)
I appreciate that you've analyzed GPT3's inability to rhyme rhyme well. And maybe some of that comes from BPE. But saying that "GPT3 _cannot_ rhyme" is a strong and unsupportable statement. What, really, would the difference be between memorizing a rhyming dictionary and being able to apply that vs "actually" learning to rhyme? Because GPT3 can certainly do the former, so why can't it do the latter?
Now if you ran an experiment comparing MLM (or any LM) on rhyming tasks with different encodings, then you could certainly make a statement like "BPE is worse at rhyming than other encodings" and it would be scientifically supportable. And that very well might be true. But your extreme conclusion is not supportable.
What would the difference be? That's very obvious - just think about any kind of comic or light verse!
A rhyme dictionary would still not replicate human rhyming capabilities. Think about neologisms or misspellings. A model can memorize every single entry in the rhyming dictionary (and let's say this somehow cashes out as apparent rhyming proficiency in being able to for every word recall an entry in the rhyme dictionary of valid other words), but it would not be able to write something like "Jabberwocky" inventing a bunch of new words or phrases or names which rhyme. (How would it know to rhyme "wabe" and "outgrabe" when they appear in no dictionaries - because they were just invented?) A model which has "actually" learned to rhyme would be able to take new words (not necessarily invented by it, but possibly invented by humans after it was trained, or invented on the spot for a prompt, or part of a new fictional work like worldbuilding) and rhyme them appropriately. A model which has memorized a rhyming dictionary would not.
I'm genuinely confused why you take such an extreme position on this issue. You seem like you understand some things about how neural networks operate. So I'd assume you understand their ability to interpolate between examples to new situations they've never literally seen before - what is commonly referred to as "generalization" in ML, which is really the key concept in the entire field of ML. But for some reason you've decided this simply can't apply to rhyming for the world's most advanced language model. Your choice buddy.
Interesting! So what I am hearing from this is that some day we will overcome this and have programs that can rap. And then companies with customer support bots will have a little check box if you want the bot to rap all responses to you.
Technically, we know exactly what we have to do to GPT-3 to give it a chance to learn how to rhyme: get rid of BPES and just use raw UTF-8 (or even just ASCII) as the encoding. Enlarge the model a bit, as necessary.
At least that's what I am getting from Gwern's write-up. I might have misunderstood Gwern, or Gwern might be wrong, of course.
Replacing BPEs not with characters but with a sylabbary (is that a word? a vocabulary made of possible syllables) would be even more powerful, and you could also run preprocessing to convert text to phonetic transcription, to enforce a concept that spelling is irrelevant but the pronunciation matters.
A syllabary or a phonetic encoding like IPA would patch the rhyme problem, but it would sabotage you in still other ways. People expect a language model to be able to solve anagrams or reverse strings, for example. And there are a lot of things in written language which are connected to the exact spelling but not necessarily the phonetics; there is no list of what those things are or what not learning them would do to a model, and you would have to discover (or more likely, not discover, in the way that people keep not discovering the BPE sabotage) the drawbacks the hard way. So you have a Bitter Lesson tradeoff here: yeah, you can build in that concept instead of learning from raw Unicode/bytes, and it will work initially better, but you are handicapping yourself in the long run. So, I always say go for raw Unicode for the foreseeable future.
That would make the system more complicated by baking in extra assumptions.
And eg it would probably perform worse at translating from English to French than a naive system. (Unless you preprocess your corpus extensively to figure out when you get English and when you get a snippet of French.) GPT-3 is surprisingly good at translating English to French.
Another problem is that English has not just one text-to-phonetic transcription: different accents pronounce words differently. For just one example:
> n most non-rhotic accents, if a word ending in written "r" is followed immediately by a word beginning with a vowel, the /r/ is pronounced, as in water ice. That phenomenon is referred to as "linking R". Many non-rhotic speakers also insert an epenthetic /r/ between vowels when the first vowel is one that can occur before syllable-final r (drawring for drawing). The so-called "intrusive R" has been stigmatized, but many speakers of Received Pronunciation (RP) now frequently "intrude" an epenthetic /r/ at word boundaries, especially if one or both vowels is schwa. For example, the idea of it becomes the idea-r-of it, Australia and New Zealand becomes Australia-r-and New Zealand, the formerly well-known India-r-Office and "Laura Norder" (Law and Order). The typical alternative used by RP speakers (and some rhotic speakers as well) is to insert a glottal stop wherever an intrusive R would otherwise have been placed.
Btw, even without any explicit pronunciation data, I would expect the system to get a pretty good idea of how pronunciation works by essentially using the same technique human linguists use to reconstruct old pronunciations:
They observe what kinds of mistakes people make.
Eg when you see people mixing up "there" and "their" and "they're" in the corpus, that tells you that in modern English these three are pronounced almost the same.
From spelling mistakes in words like carburetor you can figure out that unstressed vowels in English are pretty much all pronounced the same: as a schwa. https://en.wikipedia.org/wiki/Schwa
But you have to make those assumptions if you want to make a model for rhyming because large quantities of raw prose does not contain any information whatsoever about which words rhyme, how they are pronounced or where the accents lie. And of course that would be worse for other tasks, that's kind of the whole point - discarding irrelevant information is the key part that distinguishes learning from memorizing; for most use cases you do want to discard the pronunciation part and other nuances of representation to focus on the semantics, but for some (like this) you may want to discard some other parts in order to focus on how the language sounds like. The 'no free lunch theorem' is conceptually relevant even if it doesn't literally apply here.
Your example of their/they're/there has some data because the whole words can be mistaken for each other, but even if you take a billion pages of prose, you won't get data to deduce that 'there' rhymes with 'pear' but not 'here', that 'great' rhymes with 'straight' but 'cough', 'dough' and 'through' do not rhyme. A model can't learn something that's simply not represented in the training data.
So you either have to bring in external information (i.e. pronunciation models, or dictionary data with pronunciation guides, or audio recordings) or you have to have sufficient rhyming text inside your training data - i.e. train it on corpora of poetry instead of prose.
Also, I'm not sure if the variation of pronunciation is critical - any accent variations that affect every instance of some sound similarly would preserve rhyming. There are certain differences e.g. I recall a discussion about some parts of Shakespeare which rhyme perfectly in pronunciation of that period but do not in modern English, but I think that the variations of modern English accents should be mostly OK from that perspective.
> So you either have to bring in external information (i.e. pronunciation models, or dictionary data with pronunciation guides, or audio recordings) or you have to have sufficient rhyming text inside your training data - i.e. train it on corpora of poetry instead of prose.
Yes, the latter. Just mix some rhyming text into your corpus. If you take eg Wikipedia, there's already plenty of rhyming stuff in there. Nursery rhymes, song lyrics, etc. Similar for other corpora.