Hacker News new | past | comments | ask | show | jobs | submit login

The way the article presents this is misleading. The attention mechanism builds a new vector as a linear combination of other vectors, but after the first layer these have also all been altered by passing through a transformer layer so it makes less sense to talk about "other tokens" in most cases (it becomes increasingly inaccurate the deeper into the model you go). It's also not really moving closer so much as adding, and what it's adding isn't the embedding-derived-vector but a transform of the embedding-derived-vector after it's been projected into a lower-dimensional-space for that attention head.

It would be more accurate to say that it's integrating information stored in other vectors-derived-from-token-embeddings-at-some-point (which can also entail erasing information)




Isn't "adding" the same as "moving closer" ?

E.g. the vector for "bank" is mid-way between the geographical and financial meaning, "bank + money" is closer while "bank + river" if further away.


It depends on the values of the vectors. (4, 4) + (3, 3) results in a new vector (7, 7) which is further away from both contributing vectors than either one was to each other originally. Additionally, negative coefficients are a thing.


You still have one vector per token, that's what they meant, also the fact that the vector associated with each token will ultimately be used to predict the next token, once again showing that it makes sense to talk about other tokens even though they're being transformed inside the model.


Prediction happens at the very end (sometimes functionally earlier, but not always) - most of what happens in the model can be thought of as collecting information in vectors-derived-from-token-embeddings, performing operations on those vectors, and then repeating this process a bunch of times until at some point it results in a meaningful token prediction.

It's pedagogically unfortunate that the residual stream is in the same space as the token embeddings, because it obscures how the residual stream is used as a kind of general compressed-information conduit through the model that attention heads read and write different information to to enable the eventual prediction task.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: