Hacker News new | past | comments | ask | show | jobs | submit login

What did you use to compute P(D|Ci)?

This uses your notation from the Dr Dobbs article, so that Ci (a category) is a user and D is a document (a comment?).

Did you use something like trigram signatures?

Also, is P(Ci) equal to #comments made by Ci/total number of users?

Interesting stuff jgrahamc!




I just used whitespace separated words after stripping punctuation.


I'd have thought that capitalisation and punctuation were key elements in any textual analysis. In the subject text there is a very unusual hyphenation "pure-ad" for example.


In authorship identification punctuation can be very important, as can non-grammatical features like the distribution of the number of syllables per word.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: