His data formatting could be improved here. Title + authors would be better off ...

eugenhotaj · on Oct 23, 2019

Hi, author here.

Yea I did the delimiting you mentioned when "training" a bigram model. For GPT-2 I was mostly interested in how well the model would be able to pick up signals from the raw data so I didn't do any kind of preprocessing at all (it's also not very fun ;)). I think it's interesting that the model was able to pick up titles, authors, starts/ends of songs on it's own.

I didn't try generating specific songs but that's a good idea. Having the delimiters would probably improve things but feeding in "On the Run\nJohn Lennon" would work as well with the current approach.

Using RapGenius corpus is also something interesting that I didn't think about. The goal of the post was to generate Beatles lyrics not song lyrics in general. To that end, I'd like to see what you get if you first fine tune on RapGenius to learn general things like song structure, rhyme, etc, then fine tune even further on the Beatles corpus. I suspect you'd get much nicer, less memorized songs.

gwern · on Oct 23, 2019

> it's also not very fun

Pish-posh! It's a single simple search-and-replace: replace '\n\n\n' with '\n<|endoftext|>\n' or so. For bonus, you can use regexp capture groups to rewrite the metadata simultaneously - something like '\n\n\n\(.\)\n\(.\)' → '\n<|endoftext|>\n"\1", by \2\n'.

> The goal of the post was to generate Beatles lyrics not song lyrics in general. To that end, I'd like to see what you get if you first fine tune on RapGenius to learn general things like song structure, rhyme, etc, then fine tune even further on the Beatles corpus. I suspect you'd get much nicer, less memorized songs.

You can do it either way: either train a single model on a multi-artist corpus and then simply prompt it appropriately, or train the single model and then further finetune on just the specific artist. I've tried both in various ways with GPT-2 and StyleGAN, and it's not clear which is best, although I hypothesize that the two-stage pretraining works best with very small corpuses, where in the multi-artist corpus single model, all the other artists might 'squeeze out' the desired artist (a kind of class imbalance), eliminating the transfer benefits.

with StyleGAN, a major benefit of the two-stage pretraining approach is that there's no easy way to 'condition' on a specific class or input; so with my anime face generator (https://www.gwern.net/Faces), when I wanted specific characters, I'd just finetune on that character alone because it's easy to select out just their data and create character-specific corpuses.

drusepth · on Oct 23, 2019

>To that end, I'd like to see what you get if you first fine tune on RapGenius to learn general things like song structure, rhyme, etc, then fine tune even further on the Beatles corpus. I suspect you'd get much nicer, less memorized songs.

OT: Is that how fine-tuning actually works with GPT-2? It makes sense that it'd just be strengthening connections on the most-recently-fine-tuned corpus, with previous fine-tunes still around in some way.

Should you expect that first fine tune to pick up and solidify song structure, rhyme, etc, and the second fine tune to keep those concepts in place while muddying up other aspects like the specific lyrics used?

(Hope this doesn't come off as "you're wrong" or too off topic -- I'm just very interested and would love to read more about how all this works. :) )

eugenhotaj · on Oct 23, 2019

I would expect it to (but I haven't thought about it too deeply so I could be extremely wrong). My thinking is as follows:

At the end of the day, all we're doing is maximum likelihood estimation. So we're trying to find model parameters which define a probability distribution where our observed data is the most probable. In the original GPT-2, this observed data is the text from quality outgoing links on Reddit. Since this data is so diverse, there will not really be any special structure that the model can pick up on, besides whatever structure exists in the English language.

However, when we fine-tune on RapGenius, the observed data is now songs. These songs have a certain structure to them such as stanzas, rhyming, etc. In order to maximize the likelihood of this data, the model must learn to model the structure.

Finally, if we further fine-tune on Beatles lyrics, the model is again trying to find parameters which maximize the likelihood of the data. So the model will try to match both the lyrics and the structure of Beatles songs. It's likely that the structure of Beatles songs is pretty similar to the other songs from RapGenius, so mostly what will change are the lyrics. Also, changing the lyrics seems to be the most straightforward way to maximize the likelihood since by definition we want these particular lyrics to be the most likely.

That being said, this is all just conjecture. It would be interesting to try out both methods and see if you get better results doing this two step fine-tuning vs the original fine tuning (or just fine tuning on RapGenius then conditionally sampling Beatles songs as @gwern suggested).