What do you mean? Speculative decoding can be done with any auto-regressive model. Normally you use another much faster model to predict the next N subwords, and then you use the big model to verify whether it gets the same output, or maybe just reranked. Evaluating N subwords in one go is much faster compared to doing it subword by subword. That's why this is faster. Not all N words might match, so then you might need to redo the prediction for M < N subwords, but there are many simple cases where a faster and weaker model is still accurate enough. In the very extreme case, where N-1 subwords are always wrongly predicted, it would be slightly slower, but usually you get quite a big speedup, e.g. 3x faster or so.
The nice thing here is that you actually don't need another smaller model but the model itself already predicts the next N subwords.
Or maybe you mean it's not implemented in some of the common software? I'm not sure about that, but I thought it's a quite popular feature now.
The nice thing here is that you actually don't need another smaller model but the model itself already predicts the next N subwords.
Or maybe you mean it's not implemented in some of the common software? I'm not sure about that, but I thought it's a quite popular feature now.