Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I tried to train bayesian (and other) classifiers to reliably pick the same stories to read as I would. Despite looking at a variety of things - title, poster, domain, corpus from the article, corpus of the comments, I found their accuracy was never really better than 60%.

Then I tried rating the same set of articles myself several times. My accuracy was only around 60% too.

Figures.



Yeah, I think it shows how hard it is to classify something with such a diverse set of stories. Each week for my Hacker Newsletter project I have to come up with a short list of links to share. I don't ever want to make it "automated", but at the same time I need to narrow down what to pick from. I have tried several things in the past, but what has worked best for me is using a combination of: what i voted up, votes, # of comments, if <user> commented, time on front page, and finally a lot of regex filters. Tying all that together with a simple interface/tool allows me to find a list of articles that I think my subscribers will enjoy.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: