Hacker News new | past | comments | ask | show | jobs | submit login

They said in the announcement that they've implemented speculative decoding, so that might have a lot to do with it.

A big question is what they're using as their draft model; there's ways to do it losslessly, but they could also choose to trade off accuracy for a bigger increase in speed.

It seems they also support only a very short sequence length. (1k tokens)




Speculative decoding does not trade off accuracy. You reject the speculated tokens if the original model does not accept them, kind of like branch prediction. All these providers and third parties benchmark each other's solutions, so if there is a drop in accuracy, someone will report it. Their sequence length is 8k.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: