Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If this is like most n-gram analyses, the percentages are of the total corpus, i.e. percentage of words, not articles. So 60,000 articles could be 12,000,000 words and 2400 positives if there are 200 words per article (a SWAG).


Looks like you are right. From the FAQ:

>What does the y-axis mean exactly? The y-axis represents the frequency of each phrase, as a percentage of all phrases that contain the same number of words. For example, if you search for from New York, the graph shows the number of times those words appear in exact order, divided by the total number of 3 word phrases in all of the articles

I think doing it at a per-article level makes more sense for an analysis like this, but 0.02% is actually pretty significant when n is on the order of millions.

Thanks for the clarification.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: