Is beam search a good idea? Whenever anyone tries beam search on a neural langua...

yorwba · on Oct 23, 2019

If I'm interpreting "Tricks in beam search to force rhyme schemes" correctly, the idea is to filter the beams and only keep those which correspond to the chosen scheme. You don't have to use beam search to be able to do that; you could also rollback the generation process whenever it doesn't rhyme and try again with a different alternative.

kastnerkyle · on Oct 23, 2019

Yes - the crux is just to add some logic and throw out beams which don't match your constraint, then rank candidates based on sequence probability.

You can roll-back the generation process and/or mask the probability distribution using simple secondary logic, but I find beam search gives generally better results, especially when the word I want to force is very low probability - most of my sequence models kind of go off the rails when they are forced into a low-probability sequence ("the man went to the xylophone zebra sdawoqhdjwna"). Also I find this problem gets worse in domains without "reset" tokens like spaces, where there are always high entoropy possibilities (the letter after a space has a lot of good choices) followed by lower ones (after the first letter, there often become less good choices - at least until you hit another space). Particularly in music generation, models that sample a "surprising" sequence tend to go off the rails. It is also a behavior that seems worse in RNNs, than transformers for me.

eugenhotaj · on Oct 23, 2019

I've seen some research [1] where the authors use beam search with an explicit diversity penalty to get around the repetition problem. They seem to get good results.

[1] https://arxiv.org/pdf/1610.02424.pdf

kastnerkyle · on Oct 23, 2019

There are many flavors of beam search - I have found that for adding explicit checks and constraints (for example rhyme constraints or certain pivot words) the resulting proposals are generally a lot better. Even with simple markov chains I see pretty diverse behavior depending on beam search style.

Some of the better ones I used were variants of diverse beam search, and stochastic beam searches usually combined together. The "classic" / pure variant has generally not been as useful in generative modeling for me, it tends to collapse to basically one or two effective candidates (with maybe some filler words changed) fairly quickly.

Also it seems to generally work better for me in conditional generation than in unconditional generation (e.g. charRNN / some uses of GPT-2). However, things like the "repetition problem" can be removed by construction if you are willing to hack in the beam search just a little bit. See https://badsamples.tumblr.com/post/160777871547/stochastic-s... (stochastic, diverse beam search w Markov iirc) vs https://badsamples.tumblr.com/post/160767248407/a-markov-arg... (fixed beam search, where I didn't try to remove repetition or anything special, same Markov setup)

Sometimes I also manipulate probabilities with masks and things directly, and that also combines fine with beam search in the experiments I have done.

Nucleus sampling works well, and if you don't want to control or constrain the output in unconditional generation I don't know that beam search really does much. But for conditional generation, or post-hoc hacks to get more control over a generator I find beam search variants really useful. Especially combined with a specifically conditional architecture.

For example, conditioning the language model on a particular bag-of-rhyme-words + (stochastic, probably) beam search to force rhyme pairs at the start and end of lines, probably further modified by input and output masks to "blank out" invalid tokens and tell the model which tokens will be blanked out. I've used some blend of these tricks in speech, music, and text experiments and it can be helpful if you have structure that is important to replicate and a simple model, with simple sampling just isn't replicating the necessary structure.

EDIT: One practical reason to do this would be plagiarism detection, especially if fine-tuning a small corpus. There are ways with guarantees by construction (https://www.researchgate.net/profile/Pierre_Roy2/publication...) but simple setups using beam searches and tries can also do constraint checks for n-grams of certain lengths. Concretely, set up tries for 1-2-3-4-...-nminus1-grams, which are considered "valid" transitions, then set a "bad" trie for n-grams. Check these tries during generation, and throw out any candidates which violate the "bad" trie, but still match in the good one.

See the line of Max Order work from the Sony CSL lab (formerly run by Francois Pachet) for some examples of this.