Hacker News new | past | comments | ask | show | jobs | submit login
Finetuning GPT-2 to Generate Beatles Lyrics (towardsdatascience.com)
55 points by eugenhotaj on Oct 23, 2019 | hide | past | favorite | 14 comments



His data formatting could be improved here. Title + authors would be better off denoted somehow, like using quotes, and the separate songs should be explicitly delimited using '<|endoftext|>' - looking at the samples in https://github.com/EugenHotaj/beatles/blob/master/gpt_2_gene... , GPT-2 does manage to mostly figure out that the songs are separate, but omitting '<|endoftext|>' makes it harder on GPT-2, more prone to runons (already a problem with GPT-2), and also makes prompting less effective (since you can't prompt it like '<|endoftext|>"On The Run" by John Lennon\n' to make it generate lyrics for a specific title & author). Also wouldn't be bad if he had included the specific commands + hyperparameters for the nshepperd repo he's apparently using, even if only the defaults along the lines of the examples in my own writeup ( https://www.gwern.net/GPT-2 ).

I'm not surprised that GPT-2-117M has memorized songs by the end of training, it's not a very large corpus of songs. Hard to learn and generalize well from it. If one were working more on this, it'd probably make sense to train on a much larger and varied corpus of song (with inline metadata properly formatted to allow controllable generation); something like RapGenius, maybe?


Hi, author here.

Yea I did the delimiting you mentioned when "training" a bigram model. For GPT-2 I was mostly interested in how well the model would be able to pick up signals from the raw data so I didn't do any kind of preprocessing at all (it's also not very fun ;)). I think it's interesting that the model was able to pick up titles, authors, starts/ends of songs on it's own.

I didn't try generating specific songs but that's a good idea. Having the delimiters would probably improve things but feeding in "On the Run\nJohn Lennon" would work as well with the current approach.

Using RapGenius corpus is also something interesting that I didn't think about. The goal of the post was to generate Beatles lyrics not song lyrics in general. To that end, I'd like to see what you get if you first fine tune on RapGenius to learn general things like song structure, rhyme, etc, then fine tune even further on the Beatles corpus. I suspect you'd get much nicer, less memorized songs.


> it's also not very fun

Pish-posh! It's a single simple search-and-replace: replace '\n\n\n' with '\n<|endoftext|>\n' or so. For bonus, you can use regexp capture groups to rewrite the metadata simultaneously - something like '\n\n\n\(.\)\n\(.\)' → '\n<|endoftext|>\n"\1", by \2\n'.

> The goal of the post was to generate Beatles lyrics not song lyrics in general. To that end, I'd like to see what you get if you first fine tune on RapGenius to learn general things like song structure, rhyme, etc, then fine tune even further on the Beatles corpus. I suspect you'd get much nicer, less memorized songs.

You can do it either way: either train a single model on a multi-artist corpus and then simply prompt it appropriately, or train the single model and then further finetune on just the specific artist. I've tried both in various ways with GPT-2 and StyleGAN, and it's not clear which is best, although I hypothesize that the two-stage pretraining works best with very small corpuses, where in the multi-artist corpus single model, all the other artists might 'squeeze out' the desired artist (a kind of class imbalance), eliminating the transfer benefits.

with StyleGAN, a major benefit of the two-stage pretraining approach is that there's no easy way to 'condition' on a specific class or input; so with my anime face generator (https://www.gwern.net/Faces), when I wanted specific characters, I'd just finetune on that character alone because it's easy to select out just their data and create character-specific corpuses.


>To that end, I'd like to see what you get if you first fine tune on RapGenius to learn general things like song structure, rhyme, etc, then fine tune even further on the Beatles corpus. I suspect you'd get much nicer, less memorized songs.

OT: Is that how fine-tuning actually works with GPT-2? It makes sense that it'd just be strengthening connections on the most-recently-fine-tuned corpus, with previous fine-tunes still around in some way.

Should you expect that first fine tune to pick up and solidify song structure, rhyme, etc, and the second fine tune to keep those concepts in place while muddying up other aspects like the specific lyrics used?

(Hope this doesn't come off as "you're wrong" or too off topic -- I'm just very interested and would love to read more about how all this works. :) )


I would expect it to (but I haven't thought about it too deeply so I could be extremely wrong). My thinking is as follows:

At the end of the day, all we're doing is maximum likelihood estimation. So we're trying to find model parameters which define a probability distribution where our observed data is the most probable. In the original GPT-2, this observed data is the text from quality outgoing links on Reddit. Since this data is so diverse, there will not really be any special structure that the model can pick up on, besides whatever structure exists in the English language.

However, when we fine-tune on RapGenius, the observed data is now songs. These songs have a certain structure to them such as stanzas, rhyming, etc. In order to maximize the likelihood of this data, the model must learn to model the structure.

Finally, if we further fine-tune on Beatles lyrics, the model is again trying to find parameters which maximize the likelihood of the data. So the model will try to match both the lyrics and the structure of Beatles songs. It's likely that the structure of Beatles songs is pretty similar to the other songs from RapGenius, so mostly what will change are the lyrics. Also, changing the lyrics seems to be the most straightforward way to maximize the likelihood since by definition we want these particular lyrics to be the most likely.

That being said, this is all just conjecture. It would be interesting to try out both methods and see if you get better results doing this two step fine-tuning vs the original fine tuning (or just fine tuning on RapGenius then conditionally sampling Beatles songs as @gwern suggested).


Or any lyrics: http://billion.dev.losttech.software:2095/

And the blog article: https://habr.com/post/453232/ (also there's no paywall here)


Really cool stuff, thanks for sharing!


Tricks in beam search to force rhyme schemes, or techniques like constrained markov chains (c.f. https://redylan.neocities.org/#/how-it-works/ and https://github.com/gabrielebarbieri/markovchain) can give really strong results in lyric / structured text generation.

Might be worth investigating if you are interested in this application.


Is beam search a good idea? Whenever anyone tries beam search on a neural language model like a char-RNN or GPT-2, it seems to generally either do little or make it much worse (by exacerbating the repetition problem), and get worse the more beams/computation you do: eg https://github.com/karpathy/char-rnn/issues/138 or https://arxiv.org/abs/1904.09751


If I'm interpreting "Tricks in beam search to force rhyme schemes" correctly, the idea is to filter the beams and only keep those which correspond to the chosen scheme. You don't have to use beam search to be able to do that; you could also rollback the generation process whenever it doesn't rhyme and try again with a different alternative.


Yes - the crux is just to add some logic and throw out beams which don't match your constraint, then rank candidates based on sequence probability.

You can roll-back the generation process and/or mask the probability distribution using simple secondary logic, but I find beam search gives generally better results, especially when the word I want to force is very low probability - most of my sequence models kind of go off the rails when they are forced into a low-probability sequence ("the man went to the xylophone zebra sdawoqhdjwna"). Also I find this problem gets worse in domains without "reset" tokens like spaces, where there are always high entoropy possibilities (the letter after a space has a lot of good choices) followed by lower ones (after the first letter, there often become less good choices - at least until you hit another space). Particularly in music generation, models that sample a "surprising" sequence tend to go off the rails. It is also a behavior that seems worse in RNNs, than transformers for me.


I've seen some research [1] where the authors use beam search with an explicit diversity penalty to get around the repetition problem. They seem to get good results.

[1] https://arxiv.org/pdf/1610.02424.pdf


There are many flavors of beam search - I have found that for adding explicit checks and constraints (for example rhyme constraints or certain pivot words) the resulting proposals are generally a lot better. Even with simple markov chains I see pretty diverse behavior depending on beam search style.

Some of the better ones I used were variants of diverse beam search, and stochastic beam searches usually combined together. The "classic" / pure variant has generally not been as useful in generative modeling for me, it tends to collapse to basically one or two effective candidates (with maybe some filler words changed) fairly quickly.

Also it seems to generally work better for me in conditional generation than in unconditional generation (e.g. charRNN / some uses of GPT-2). However, things like the "repetition problem" can be removed by construction if you are willing to hack in the beam search just a little bit. See https://badsamples.tumblr.com/post/160777871547/stochastic-s... (stochastic, diverse beam search w Markov iirc) vs https://badsamples.tumblr.com/post/160767248407/a-markov-arg... (fixed beam search, where I didn't try to remove repetition or anything special, same Markov setup)

Sometimes I also manipulate probabilities with masks and things directly, and that also combines fine with beam search in the experiments I have done.

Nucleus sampling works well, and if you don't want to control or constrain the output in unconditional generation I don't know that beam search really does much. But for conditional generation, or post-hoc hacks to get more control over a generator I find beam search variants really useful. Especially combined with a specifically conditional architecture.

For example, conditioning the language model on a particular bag-of-rhyme-words + (stochastic, probably) beam search to force rhyme pairs at the start and end of lines, probably further modified by input and output masks to "blank out" invalid tokens and tell the model which tokens will be blanked out. I've used some blend of these tricks in speech, music, and text experiments and it can be helpful if you have structure that is important to replicate and a simple model, with simple sampling just isn't replicating the necessary structure.

EDIT: One practical reason to do this would be plagiarism detection, especially if fine-tuning a small corpus. There are ways with guarantees by construction (https://www.researchgate.net/profile/Pierre_Roy2/publication...) but simple setups using beam searches and tries can also do constraint checks for n-grams of certain lengths. Concretely, set up tries for 1-2-3-4-...-nminus1-grams, which are considered "valid" transitions, then set a "bad" trie for n-grams. Check these tries during generation, and throw out any candidates which violate the "bad" trie, but still match in the good one.

See the line of Max Order work from the Sony CSL lab (formerly run by Francois Pachet) for some examples of this.


Tomorrow, anybody?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: