Wonder what the split is between Russian and English in the model?

londons_explore · on June 23, 2022

Open the vocab file (from the script in the download directory) and you can get a pretty good idea.

Looks to be approximately 50/50 from my random scrolling through the list.

f311a · on June 23, 2022

That's because English and Russian have pretty similar vocabulary size. Vocabulary does not reflect the size of the data.

londons_explore · on June 23, 2022

In this case, it does, because the vocab is not a list or words, but a list of tokens. Each token may be a word, but it might also be a phrase or part of a word. The tokens are generated to be optimal on the input data - ie. for a given vocab size to minimize the number of tokens to represent it.

Therefore, the size of the vocab gives a good guide to the size of the data, since if there was 10x more english language data then the optimal distribution would be to dedicate more token space to english than russian.