It is literally not. 2/3 of the weights are in the multi-layer perceptron which ...

It is literally not. 2/3 of the weights are in the multi-layer perceptron which is a dynamic information encoding and retrieval machine. And the attention mechanisms allow for very complex data interrelationships.

At the very end of an extremely long and sophisticated process, the final mapping is softmax transformed and the distribution sampled. That is one operation among hundreds of billions leading up to it.

It’s like saying is a jeopardy player is random word generating machine — they see a question and they generate “what is “ followed by a random word—random because there is some uncertainty in their mind even in the final moment. That is both technically true, but incomplete, and entirely missing the point.