They just throw the predictions for +1 and +2 away, and only generate them for more efficient training.
The abstract doesn't make that clear, but from the description of figure 1: "During inference, we employ only the next-token output head.
Optionally, the other three heads may be used to speed-up
inference time"
Maybe you can use all three heads if you take the top prediction from all of them, but that prevents you from doing any of the common sampling strategies. I'm not sure how many people actually run an LLM with temperature 0 outside of benchmarks, unless they do something even better than applying a temperature
The abstract doesn't make that clear, but from the description of figure 1: "During inference, we employ only the next-token output head. Optionally, the other three heads may be used to speed-up inference time"
Maybe you can use all three heads if you take the top prediction from all of them, but that prevents you from doing any of the common sampling strategies. I'm not sure how many people actually run an LLM with temperature 0 outside of benchmarks, unless they do something even better than applying a temperature