Selecting the likeliest token is only one of many sampling options, and it's extremely poor for most tasks, moreso when you consider the relationships between multiple executions of the model. _Some_ (not necessarily softmax) probability renormalization trained into the model is issential for a lot of techniques.
To expand on this, one of the most common tricks is Nucleus sampling. Roughly, you zero out the lowest probabilities such that the remaining sum to just above some threshold you decide (often around 80%).
The idea is that this is more general than eg changing the temperature of the softmax, or using top-k where you just keep the k most probable outcomes.
Note that if you do Nucleus sampling (aka top-p) with the threshold p=0% you just pick the maximum likelihood estimate.
That's true, but they didn't go into any other applications in this explainer and were presenting it strictly as a next-word-predictor. If they are going to include final softmax, they should explain why it's useful. It would be improved by being simpler (skip softmax) or more comprehensive (present a use case for softmax), but complexity without reason is bad pedagogy.
Selecting the likeliest token is only one of many sampling options, and it's extremely poor for most tasks, moreso when you consider the relationships between multiple executions of the model. _Some_ (not necessarily softmax) probability renormalization trained into the model is issential for a lot of techniques.