They just throw the predictions for +1 and +2 away, and only generate them for m... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		wongarsu on May 1, 2024 \| parent \| context \| favorite \| on: Better and Faster Large Language Models via Multi-... They just throw the predictions for +1 and +2 away, and only generate them for more efficient training. The abstract doesn't make that clear, but from the description of figure 1: "During inference, we employ only the next-token output head. Optionally, the other three heads may be used to speed-up inference time" Maybe you can use all three heads if you take the top prediction from all of them, but that prevents you from doing any of the common sampling strategies. I'm not sure how many people actually run an LLM with temperature 0 outside of benchmarks, unless they do something even better than applying a temperature

faabian on May 1, 2024 | [–]

Exactly, but there is also a rejection sampling based method for speculative sampling: https://arxiv.org/abs/2302.01318

Havoc on May 1, 2024 | [–]

Thanks for explaining

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact