Prediction happens at the very end (sometimes functionally earlier, but not alwa...

Prediction happens at the very end (sometimes functionally earlier, but not always) - most of what happens in the model can be thought of as collecting information in vectors-derived-from-token-embeddings, performing operations on those vectors, and then repeating this process a bunch of times until at some point it results in a meaningful token prediction.

It's pedagogically unfortunate that the residual stream is in the same space as the token embeddings, because it obscures how the residual stream is used as a kind of general compressed-information conduit through the model that attention heads read and write different information to to enable the eventual prediction task.