Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As someone who has been doing this for over a decade with Twitter data in a specific niche (https://reviewsignal.com - for web hosting) it's interesting to see this getting so much attention. I played with Reddit data years ago as well but never moved forward with it. The volume really wasn't there for my needs. Why the volume mattered is exactly the reason you are kind of touching on, fake reviews. You need an overwhelmingly large sample size to drown out fake reviews.

I am also curious what sort of spam filtering mechanisms you have in place? Just the spam filters before content ever hit sentiment analysis or relevancy analysis was 98% in my data. I imagine Reddit is better than Twitter, but there is still is going to be spam. What measures do you have in place and do you determine them?

Do you take upvotes into account with weighting reviews? That was/is a concern I had when working with reddit data. I used retweets as a proxy for sorting popularity, but not any other weighting.

I'm definitely interested in this segment as I've been doing it for a long time, if you want to talk please reach out my contact is in my profile.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: