Hacker News new | past | comments | ask | show | jobs | submit login

Wonder what the split is between Russian and English in the model?



Open the vocab file (from the script in the download directory) and you can get a pretty good idea.

Looks to be approximately 50/50 from my random scrolling through the list.


That's because English and Russian have pretty similar vocabulary size. Vocabulary does not reflect the size of the data.


In this case, it does, because the vocab is not a list or words, but a list of tokens. Each token may be a word, but it might also be a phrase or part of a word. The tokens are generated to be optimal on the input data - ie. for a given vocab size to minimize the number of tokens to represent it.

Therefore, the size of the vocab gives a good guide to the size of the data, since if there was 10x more english language data then the optimal distribution would be to dedicate more token space to english than russian.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: