I'd have thought that capitalisation and punctuation were key elements in any textual analysis. In the subject text there is a very unusual hyphenation "pure-ad" for example.
In authorship identification punctuation can be very important, as can non-grammatical features like the distribution of the number of syllables per word.
This uses your notation from the Dr Dobbs article, so that Ci (a category) is a user and D is a document (a comment?).
Did you use something like trigram signatures?
Also, is P(Ci) equal to #comments made by Ci/total number of users?
Interesting stuff jgrahamc!