Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Site seems overloaded. Internet Archive has a capture.

https://web.archive.org/web/20221028111744/http://simia.net/...



Looks very much web-based, and not cleaned properly. I conclude that because digits are pretty rare in a normal corpus, much rarer than x and y. The English list also has some punctuation included, and half of the Greek alphabet. The counting didn't exclude proper names and formulas, I suppose. So if you want to identify the domain of a Wikipedia page based on 1-grams, this is helpful; otherwise, less so.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: