Simple. Token sequences are "fact checked" and the training data is annotated to...

Simple. Token sequences are "fact checked" and the training data is annotated to update the accuracy value of the tokens in context. Then the sequences "Donald Trump is the new Jesus" and "Donald Trump is the new Hitler" would have different accuracy scores (probably represented as mean/deviation to encapsulate uncertainty/divisiveness).

When I say solving the language switch issue, I mean something akin to adding a translation layer to transformers, so you're learning a translation to a meta-language and meta-language token transition probabilities simultaneously.