Their tokenizer seems very algorithmic and pre-determined. I suppose that is a t...

Their tokenizer seems very algorithmic and pre-determined. I suppose that is a typical burden of multi-class classifiers. It might be interesting to have a way of dynamically tokenizing words/sentences based on confidence about them. Some unknown words might need a letter-for-letter representation, while some commonly occurring phrases could almost be encoded in a single token. That'd probably require a completely different way of representing stuff and computing a loss as the representations would change throughout training as the confidence about some context changes.