Not a statistician, but this still seems flawed. The pretend votes need to be re...

iainmerrick · on Nov 16, 2017

It doesn't have to be correct, just a plausible starting point. The "pretend votes" have less importance as more real votes come in.

I do think this article suggests add too many pretend votes. Without the kind of justification you're talking about, it's usually better to add only a couple (reflecting low confidence in the prior).

naasking · on Nov 16, 2017

I'm just getting into stats, but the way I see it, a new item has 0 votes with an error bar +/- the number of possible votes. Each sorting should then include a randomization factor related to the error bar, and so randomly promote some new items into top rankings so they get some exposure to gather votes. As they accumulate votes, the error bar shrinks as the ranking becomes a little more certain.

iainmerrick · on Nov 16, 2017

I believe the right way to think about it isn't error bars, but the entire probability distribution -- what's the probability that if everybody voted, the upvote/downvote ratio would be 75/25, 80/20, 85/15, etc. Once you've figured out the probability distribution, you can calculate error bars any way you like (e.g. 95% confidence interval).

The beta distribution is one model you can choose for that probability distribution, which happens to have some nice properties that make it easy to work with.

The other question is, what's the "zero knowledge" probability distribution? I think your "0 votes with an error bar +/- the number of possible votes" would translate to "uniform probability of any result", which I think is beta(1,1).

Depending on the scenario, though, you might look at the data and observe that extreme values are very uncommon, and therefore start with something like beta(2,2) instead (a bell curve rather than a flat distribution). That has minimal impact once you have lots of real upvote/downvote data, but it makes a huge difference to how the first few votes are interpreted.

naasking · on Nov 16, 2017

Right, that sounds more like what I meant. Still familiarizing myself with the terminology, thanks!

dandermotj · on Nov 16, 2017

I like to think about it this way. By not explicitly imposing a prior, you are implicitly imposing a prior that each item will receive no votes. This is totally non sensical because of course these items will get votes.

Just because we don't know what the true value of p will be doesn't mean we don't have some expectation. If I asked you what you expect the popularity of a given item will be, you won't say 0, you'll say something like the average. So why assume all items will have 0 votes in our model?