Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not a statistician, but this still seems flawed. The pretend votes need to be related to the person seeing the list of items. These normally come from the population (i.e. if you were ranking Netflix, the pretend votes would be the sum of all votes that exist for every movie, grouped by star count). This makes sense, because if you had no other information, your guess would just be the average of all the existing ratings.

The problem is that the pretend votes need to be culled in order to be predictive. Otherwise they dominate in the arithmetic. They need to be more specific to the user looking at the ranking. Continuing with the Netflix example, if a user was looking for scary movies, the pretend votes need to come from the corpus of all scary movies, rather than all movies that exist.

Here's the problem, there doesn't seem to be a good way to narrow the pretend votes. Worse, there isn't a good way to combine the two. If the pretend votes came from two sources, its not clear what to do. For example, if the user is from California, the California pretend votes (priors?) need to be combined with the scary movie pretend votes.

How can we add pretend votes without justifying where they came from?



It doesn't have to be correct, just a plausible starting point. The "pretend votes" have less importance as more real votes come in.

I do think this article suggests add too many pretend votes. Without the kind of justification you're talking about, it's usually better to add only a couple (reflecting low confidence in the prior).


I'm just getting into stats, but the way I see it, a new item has 0 votes with an error bar +/- the number of possible votes. Each sorting should then include a randomization factor related to the error bar, and so randomly promote some new items into top rankings so they get some exposure to gather votes. As they accumulate votes, the error bar shrinks as the ranking becomes a little more certain.


I believe the right way to think about it isn't error bars, but the entire probability distribution -- what's the probability that if everybody voted, the upvote/downvote ratio would be 75/25, 80/20, 85/15, etc. Once you've figured out the probability distribution, you can calculate error bars any way you like (e.g. 95% confidence interval).

The beta distribution is one model you can choose for that probability distribution, which happens to have some nice properties that make it easy to work with.

The other question is, what's the "zero knowledge" probability distribution? I think your "0 votes with an error bar +/- the number of possible votes" would translate to "uniform probability of any result", which I think is beta(1,1).

Depending on the scenario, though, you might look at the data and observe that extreme values are very uncommon, and therefore start with something like beta(2,2) instead (a bell curve rather than a flat distribution). That has minimal impact once you have lots of real upvote/downvote data, but it makes a huge difference to how the first few votes are interpreted.


Right, that sounds more like what I meant. Still familiarizing myself with the terminology, thanks!


I like to think about it this way. By not explicitly imposing a prior, you are implicitly imposing a prior that each item will receive no votes. This is totally non sensical because of course these items will get votes.

Just because we don't know what the true value of p will be doesn't mean we don't have some expectation. If I asked you what you expect the popularity of a given item will be, you won't say 0, you'll say something like the average. So why assume all items will have 0 votes in our model?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: