> 6 Selecting the likeliest token is only one of many sampling options, and it's...

toxik · on April 15, 2023

To expand on this, one of the most common tricks is Nucleus sampling. Roughly, you zero out the lowest probabilities such that the remaining sum to just above some threshold you decide (often around 80%).

The idea is that this is more general than eg changing the temperature of the softmax, or using top-k where you just keep the k most probable outcomes.

Note that if you do Nucleus sampling (aka top-p) with the threshold p=0% you just pick the maximum likelihood estimate.

jaidhyani · on April 16, 2023

That's true, but they didn't go into any other applications in this explainer and were presenting it strictly as a next-word-predictor. If they are going to include final softmax, they should explain why it's useful. It would be improved by being simpler (skip softmax) or more comprehensive (present a use case for softmax), but complexity without reason is bad pedagogy.