Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The problem with speculative decoding is that there are hardly any models that support it and adding support takes extra GPU time. If speculative decoding also improves planning performance, then it will be more readily adopted.


What do you mean? Speculative decoding can be done with any auto-regressive model. Normally you use another much faster model to predict the next N subwords, and then you use the big model to verify whether it gets the same output, or maybe just reranked. Evaluating N subwords in one go is much faster compared to doing it subword by subword. That's why this is faster. Not all N words might match, so then you might need to redo the prediction for M < N subwords, but there are many simple cases where a faster and weaker model is still accurate enough. In the very extreme case, where N-1 subwords are always wrongly predicted, it would be slightly slower, but usually you get quite a big speedup, e.g. 3x faster or so.

The nice thing here is that you actually don't need another smaller model but the model itself already predicts the next N subwords.

Or maybe you mean it's not implemented in some of the common software? I'm not sure about that, but I thought it's a quite popular feature now.


For anyone interested in exploring this, llama.cpp has an example implementation here:

https://github.com/ggerganov/llama.cpp/tree/master/examples/...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: