The LLM sees tokens, and predicts next tokens. These tokens encode a vast world, as experienced by humans and communicated through written language. The LLM is seeing the world, but through a peephole. This is pretty neat.
The peephole will expand soon, as multimodal models come into their own, and as the models start getting mixed with robotics, allowing them to go and interact with the world more directly, instead of through the medium of human-written text.
It sees embeddings that is trained to encode semantic meanings.
The way we tokenize is just a design choice. Character level models(e.g. karpathy's nanoGPT) exist and are used for educational purpose. You can train it to count number of 'r' in a word.