Non-determinism is a red herring, and the token layer is a wrong abstraction to use for this, as determinism is completely orthogonal to correctness. The model can express the same thing in different ways while still being consistently correct or consistently incorrect for the vague input you give it, because nothing prevents it from setting 100% probability to the only correct output for this particular input. Internally, the model works with ideas, not tokens, and it learns the mapping of ideas to ideas, not tokens to tokens (that's why e.g. base64 is just essentially another language it can easily work with, for example).
That's irrelevant semantics, as terms like ideas, thinking, knowledge etc. are ill-defined. Sure, you can call it points in the hidden state space if you want, no problem. Fact is, the correctness is different from determinism, and the forest of what's happening inside doesn't come down to the trees of most likely tokens, which is well supported by research and very basic intuition if you ever tinkered with LLMs - they can easily express the same thing in a different manner if you tweak the autoregressive transport a bit by modifying its output distribution or ban some tokens.
There are a few models of what's happening inside that hold different predictive power, just like how physics has different formalisms for e.g. classical mechanics. You can probably use the same models for biological systems and entire organizations, collectives, and processes that exhibit learning/prediction/compression on a certain scale, regardless of the underlying architecture.