SentencePiece is a tool and library for training and using tokenizers, and supports two algorithms: Byte-Pair Encoding (BPE) and Unigram. You could almost say it is the library for tokenizers, as it has been standard in research for years now.
Tiktoken is a library which only supports BPE. It has also become synonymous with the tokenizer used by GPT-3, ChatGPT and GPT-4, even though this is actually just a specific tokenizer included in tiktoken.
What Mistral is saying here (in marketing speak) is that they trained a new BPE model on data that is more balanced multilingually than their previous BPE model. It so happens that they trained one with SentencePiece and the other with tiktoken, but that really shouldn't make any difference in tokenization quality or compression efficiency. The switch to tiktoken probably had more to do with latency, or something similar.
Tiktoken is a library which only supports BPE. It has also become synonymous with the tokenizer used by GPT-3, ChatGPT and GPT-4, even though this is actually just a specific tokenizer included in tiktoken.
What Mistral is saying here (in marketing speak) is that they trained a new BPE model on data that is more balanced multilingually than their previous BPE model. It so happens that they trained one with SentencePiece and the other with tiktoken, but that really shouldn't make any difference in tokenization quality or compression efficiency. The switch to tiktoken probably had more to do with latency, or something similar.