The problem with speculative decoding is that there are hardly any models that s...

albertzeyer · on May 1, 2024

What do you mean? Speculative decoding can be done with any auto-regressive model. Normally you use another much faster model to predict the next N subwords, and then you use the big model to verify whether it gets the same output, or maybe just reranked. Evaluating N subwords in one go is much faster compared to doing it subword by subword. That's why this is faster. Not all N words might match, so then you might need to redo the prediction for M < N subwords, but there are many simple cases where a faster and weaker model is still accurate enough. In the very extreme case, where N-1 subwords are always wrongly predicted, it would be slightly slower, but usually you get quite a big speedup, e.g. 3x faster or so.

The nice thing here is that you actually don't need another smaller model but the model itself already predicts the next N subwords.

Or maybe you mean it's not implemented in some of the common software? I'm not sure about that, but I thought it's a quite popular feature now.

HanClinto · on May 1, 2024

For anyone interested in exploring this, llama.cpp has an example implementation here:

https://github.com/ggerganov/llama.cpp/tree/master/examples/...