It's beam search. You don't generate one word at a time but you generate groups of words and consider how the conditional probability distribution works
That's the miracle of "talk like a pirate" in that a style of speech is just a conditional probability distribution.
Also the underlying model is trained in a bidirectional manner. You mask out 15% or of the word and the model tries to put them back. I remember trying to generate case studies like the one from pubmed one character at a time with RNNs and it was a terrible struggle for many reasons, the bidirectional nature of BERT-like models was a revolution, as was the use of subword features.
https://huggingface.co/docs/transformers/generation_strategi...
That's the miracle of "talk like a pirate" in that a style of speech is just a conditional probability distribution.
Also the underlying model is trained in a bidirectional manner. You mask out 15% or of the word and the model tries to put them back. I remember trying to generate case studies like the one from pubmed one character at a time with RNNs and it was a terrible struggle for many reasons, the bidirectional nature of BERT-like models was a revolution, as was the use of subword features.