Prediction happens at the very end (sometimes functionally earlier, but not always) - most of what happens in the model can be thought of as collecting information in vectors-derived-from-token-embeddings, performing operations on those vectors, and then repeating this process a bunch of times until at some point it results in a meaningful token prediction.
It's pedagogically unfortunate that the residual stream is in the same space as the token embeddings, because it obscures how the residual stream is used as a kind of general compressed-information conduit through the model that attention heads read and write different information to to enable the eventual prediction task.
It's pedagogically unfortunate that the residual stream is in the same space as the token embeddings, because it obscures how the residual stream is used as a kind of general compressed-information conduit through the model that attention heads read and write different information to to enable the eventual prediction task.