Hacker News new | past | comments | ask | show | jobs | submit login

I don't understand how that parallel prediction can work...

Let's say I give it as input the sentence:

I . . . . . . . . happily.

The second word to be predicted depends on the first word.




Give the model the tokens "happily" and "I", and add to each input token its respective position embedding and the position embedding for the token to be predicted. You can do this in parallel for all tokens to be predicted. The model has been trained so it can predict tokens in any position.


Yes, but is there any guarantee that the complete sentence makes sense?


That is indeed an issue. Their sampling method rejects impossible combinations.


That guarantee didn't exist with regular GPT LLMs, did it? It just came about as an emergent property of throwing more and more compute, training data, and training time at the problem


I think it’s effectively built in to the design. The model outputs a probability distribution for the first unknown token [0]. Then some code outside the model chooses a token and runs the model again with that token provided to the model. So the second output token’s probability distribution is automatically conditioned on the first output token, etc.

Sometimes people will attempt to parallelize this by using a faster model to guess a few tokens and then evaluating them in as a batch with the main model to determine whether the choices were good.

[0] Usually it outputs “logits”, which become a probability distribution when combined with a “temperature” parameter.


> I think it’s effectively built in to the design.

It isn't. There is no guarantee that successive tokens will be comprehensible.

> Usually it outputs “logits”, which become a probability distribution when combined with a “temperature” parameter.

The logits are the probability distribution (well technically, you would apply softmax). Temperature is a parameter for how you sample those logits in a non-greedy fashion.


> Temperature is a parameter for how you sample those logits in a non-greedy fashion.

I think temperature is better understood as a pre-softmax pass over logits. You'd divide logits by the temp, and then their softmax becomes more/less peaky.

    probs = (logits / temp).softmax()
Sampling is a whole different thing.


Sure, my comment about softmax was simply about the probability distribution. But temperature is still part of sampling. If you’re greedy decoding, temperature doesn’t matter.


No, but it makes more conceptual sense given the model can consider what was said before it


Isn't this bag of words all over again? Except with positional hints?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: