Brewing a Better Rating System

mpotter · on Oct 29, 2009

Hi, I'm Mike from Steepster. We thought we'd share our new ratings system we just deployed with HN as we think it's relevant for products with customer reviews, ratings, etc. It's our attempt to combat the 4.3 dilemma (discussed here recently: http://news.ycombinator.com/item?id=883890).

Background: Steepster is a community site for tea drinkers to share their tasting notes, get recommendations, and discover new teas.

Feedback appreciated!

callmeed · on Oct 29, 2009

Mike, great job on this. Very informative. I have 2 questions for you:

1. is your slider from the jQuery UI or other js framework?

2. in regards to combating the 4.3 dilemma, have you found the average ratings on steepster to be lower? maybe its too early to tell, but I'd love to see some sort of curve on your ratings distribution in a future post ...

thanks

mpotter · on Oct 29, 2009

Thanks, callmeed.

1. Yep, slider implementation is jQuery UI.

2. It is too early to tell, but we're definitely planning to share a follow up. As mentioned in the post, we had a simple thumbs up/down for ratings and were seeing a greater than 90% positive average, so we were definitely experiencing that bias. Just today, albeit with a much too small sample size, we're starting to see a more diverse mix of averages. We still expect to have that positive skew but because we're now operating with a 100 point scale in the UI, we hope the granularity will help users distinguish subtler differences in rating.

physcab · on Oct 30, 2009

It'd also be interesting to know if the number of ratings decrease or increase. I wonder if your users will find the added granularity a nuisance or an incentive.

mpotter · on Oct 30, 2009

It will be interesting. It's important to note the nature of our community and whom we expect to contribute. Generally, we're geared toward a more passionate user who we find to be more than willing to contribute at this level of granularity. So we've made the choice to cater toward their needs while still trying to remain accessible.

But, this is a good point, and I think an important one to consider when evaluating the mechanic that works best for your community/site.

cninja · on Oct 30, 2009

Very clever. Have you considered making the slider non-linear (the distance on the slider between Yuck and Meh is smaller than between Good and Awesome)? If most people are going to rate their tea somewhere between Good and Awesome, it allows more of the slider to be used.

Have you received enough ratings with this new interface to know if my assumption of ratings being clustered is accurate?

mpotter · on Oct 30, 2009

We haven't considered making it non-linear. It's an interesting suggestion, though I'd hesitate to go that way only because the user then lacks a clear 1:1 model of how the slider directly affects their rating (without explanation). We haven't received enough ratings yet to prove your assumption. When we do and if it does hold true, our assumption now is that we have _enough_ of a scale to expose meaningful differences.

You've given us good food for thought, thanks. My general feeling now is that I think it's important to leave the negative portion of the slider intact (however less it's used) to maintain a solid mental model. Might be something to test down the road though.

samstokes · on Nov 15, 2009

I'm late to comment on this post, but I've read several posts on rating systems recently and not many seemed to mention this.

Have you tested the theory that response bias is skewing the ratings upward?

In other words, given that not every tea drinker is going to bother logging on to your site, finding each tea they've tasted and leaving a rating, it seems plausible to me that people who've had a good tea-drinking experience are more likely to make that effort than those who've had an unremarkable tea.

The figures that you quote seem to support this theory. With a yes/no rating system you had 90% yes votes, or an average vote (assuming one yes vote cancels out one no) of 0.8, skewed 80% up from the unweighted mean you'd expect. With a 5-star rating you expect an average rating of 4.3, which is 65% up from the unweighted mean ((4.3 - 3.0) / (5.0 - 3.0)). So adding granularity is decreasing the skew, but not very rapidly. I'd be very interested to know what your averages are like now, with your new system.

Granted, some of the skew is due to what teas are available to be rated: presumably people are less likely to enter awful teas into your system in the first place. I realise the point of your article is about redesigning the rating to combat that skew, rather than necessarily about finding the one true rating. But if you're concerned about bias, it seems worth at least investigating all possible sources of bias.

There's an obvious asymmetry in this hypothesis - why would the response bias be in favour of strong positive experiences, rather than just strong experiences in general? Even if drinkers of mediocre tea can't be bothered to vote, why wouldn't people who've had terrible tea be just as likely to vote as those who've had great tea? I can think of two explanations. One is an innate sense that a good experience is worth more effort than a bad one - so after a bad pot of tea, there's less of an impulse to run off and tell everyone how bad it was, more to just write it off and go do something else. Another is that people motivated to tell people about bad experiences might want to do so in words, to explain what was so bad about it.

I realise this is all hand-waving - I don't have hard evidence to back up this theory. I do think it would be an interesting theory to test, for those with sufficient levels of usage of a rating system to do so.

Some anecdotal evidence comes from my use of other UGC review sites, particularly restaurant reviews. These sites usually have a disproportionately large number of negative reviews. Reading reviews for nearly any restaurant leads you to conclude that restaurant has terrible service - apparently because those who've had bad experiences are always keen to vent about them. Yet nearly all restaurants, even those with tens of vocal unhappy customers, have above-average ratings.

alabut · on Oct 29, 2009

I love that the sliding meter shows tick marks for your previous rankings of other teas - the UI reflects that your judgement of a particular tea is relative to your other experiences.

pbhjpbhj · on Oct 30, 2009

I suspect under closer analysis ones scoring breaks down to be inconsistent - "in retrospect I like tea X better than Y but not as much as Z, but I rated Z lower than Y because I didn't like it as much as P which had a higher rating", if you follow.

snprbob86 · on Oct 30, 2009

It seems that, pairwise, it is pretty easy to decide. Maybe one could go further than this and eliminate the absolute scale all together (at least at rating time).

I'm imaging a UI which asks you to pick a favorite among the item you are viewing and one similar item. You could stop there, or ask repeatedly with new comparison items until the viewed item's position on the absolute scale is unambiguous. The user could provide some rating data with just a single binary decision, but some ajax-y fade out/in of another pair could enable further ratings if they desired.

m_eiman · on Oct 30, 2009

If you do that, the Elo rating system is a good place to start algorithm-wise.

http://en.wikipedia.org/wiki/Elo_rating_system

jurjenh · on Oct 30, 2009

Taking this idea further, one could add multiple orthogonal axes (eg sweetness, bitterness, after-taste etc). Then you could rate each tea against others on each axis - either in a star-slider, scatter plot or on several individual sliders. This would allow you to rate teas against each other based on several aspects, and possibly allow recommendations based on how other people have rated teas - eg 'I want a tea that is not-too-sweet, a little bitter with a lingering aftertaste' Then again, it does add more features / visual clutter and possibly complicates things for people...

dylanz · on Oct 29, 2009

Exactly. I think the tick marks are the killer feature here. Great idea!

marcus · on Oct 30, 2009

An interesting idea might be to let the user modify a tick mark in retrospect when scoring a new tea.

lonestar · on Oct 30, 2009

The problem with this system is in the sorting. The list of "Highest Rated" teas is dominated by results where 1 person rated the tea 100.

Steepster should use a Bayesian average (http://en.wikipedia.org/wiki/Bayesian_average) so that the uncertainty of a small number of ratings is reflected in the sorting.

mpotter · on Oct 30, 2009

Yeah, sorting is an issue we're still looking at (and is still very much in transition considering the new rating system). Appreciate the suggestion! We'll add it to our list of potential solutions.

selven · on Oct 30, 2009

Start everything off with a single 50-point score. That way one person ranking it 100 will bump it up to 75, the next to 83, and so on. Such a system would cause teas that have more people upvoting them to rank higher than those that just happen to have one or two good opinions.

Eliezer · on Oct 30, 2009

That's a special case of a particular sort of Bayesian average.

ErrantX · on Oct 29, 2009

The genius is adding some previous scores. I always struggle to rate stuff fairly without anything obvious to compare it with.

mkinsella · on Oct 29, 2009

This is THE best implementation of a ratings system I've seen. Very good job.

TrevorJ · on Oct 29, 2009

I feel like this really combines the best of the granular 4 star systems with the specificity of a percentage rating. Really good stuff, I'd be interested to hear a follow up with user feedback on this approach and how it holds up long term.

mhartl · on Oct 31, 2009

This is cool, but I think virtually all rating systems suffer from the same basic problem: there's no way to turn it up to 11.

Take movies, for example. They are usually rated on a four-star scale. And yet, a three-star movie is a clear success. Few movies can realistically aspire to more than three stars. Even many four-star movies are really just trying desperately to avoid two-star land. Francis Ford Coppola was sure he was going to be fired any day from The Godfather. The production crew and actors on Star Wars thought it was practically a joke. Please, God, let Star Wars not be a B movie, they must have been thinking.

When you say ★★★ out of ★★★★, you make it look like it wasn't good enough: 75%. Movies really should be rated on a three-star scale: ★★★ out of ★★★; ★★★ = A = 100%. Anything else is gravy.

So, rate tea on a three-star scale. Three stars means "excellent tea, no clear way to make it better". ★★★½ means "Whoa, there is something better than ★★★!" ★★★★ means "This is The Godfather of tea! This tea makes me an offer I can't refuse."

robryan · on Oct 30, 2009

Something else you could think about in a rating system like this would to instead of using generic faces, you could associate each with a common tea that most tea lovers have tried.

The notches kind of do this but theres always the risk of someone rating there first tea 80, then deciding subsequent teas after are better so they need to be rated higher, when the first one should have been more around 60.

fuzzythinker · on Oct 29, 2009

I think main reason sliders aren't used is that users find it too troublesome, hence up/down and 5 stars are mainly used. I remember from my pys class that a 7 point rating system is best. But the 5 stars' simplicity and ubiquity probably trumps the benefits gained by a 7 point system. I think the best compromised is a 5 star UI implemented as 6 points by allowing 0 point assignments.

fuzzythinker · on Oct 30, 2009

For those who marked me down, would you please comment on reason? I'm getting tired of spending my time commenting and getting disapproval without reason. I don't think down votes should be on disagreements; it should be on spamish, childish, or comments that does not add anything to the topic. My main point is that sliders aren't used much because they are too troublesome for a typical user. If you disagree with that, please add your opinion. I'm not trying to take anything away from the author. In fact, I think it's an ingenious idea. But I usually dislike repeating "wow, cool" comments since so many others have done so already. It's part of my DRYness kicking in.

nkurz · on Nov 1, 2009

I voted you down because you asserted that a system was 'best' because you remember from a 'pys' class. This has to be one of the weakest 'arguments from authority' I have seen. You then asserted that a 5-star allowing zero is even better. Then why didn't your (psychology?) professor say so?

I didn't vote down because I disagree, but because you haven't made much of a case. I also downvote the 'wow, cool' comments as unhelpful, and upvote the comments that seem like they will lead to useful discussion. Without intending offense, I didn't think your comment was pitched at the right level for this audience.

Personally, I think you are on the right track, although I think 5 stars allowing halves is even better. Interestingly, Netflix (experts in this field) started out with allowing half-stars and then got rid of them, making me worry that they know better than I.

fuzzythinker · on Nov 2, 2009

You are taking every single word of my comment too seriously. If every assertion needs to have strong backing in order to be commented, the hn comments will probably be only < 10% of what it is now (again, just a guesstimate, don't take this one too seriously too). I forgot if my professor has research backing for a 7 point sys being "best", maybe he did, maybe he didn't. But I don't think I need to remember if there was indeed research backing for it to add to the discussion. Again, I don't think you should down vote every discussion just because they didn't state the research backing, but I'm not the one to tell you that, maybe others can comment on this.

As for the 7 point system being "best" (for general purpose rating), I remember it's because 5 star does not give enough granularity, while 10 points is too much. Maybe that's why Netflix took that out. Now why not 6, 8, or 9? I forgot, again, maybe there was research being done.

As for my "idea" of allow a 0 on a 5 point system; it makes it a 6 point system while retaining a 5 point UI that everyone is accustom to. What is wrong with that? Again, just asking for discussion, not trying to say it IS the best.

Now back to the topic of down vote because I don't have enough backing. If I need backing in order to comment, I wouldn't even be able to comment any of this. Is this what you think is the way hn should work? Also, in order to not make you think I have the research to back up my thoughts, I need to say that in almost every sentence. I also don't think that should be the way hn works.

nkurz · on Nov 3, 2009

As for my "idea" of allow a 0 on a 5 point system; it makes it a 6 point system while retaining a 5 point UI that everyone is accustom to. What is wrong with that?

Nothing is wrong with it necessarily --- it's all a matter of implementation and audience. I think the first thing you are going to run into is a need to visually differentiate a non-vote from a zero. I'm also not sure what problem it's trying to solve.

What I would find more useful (from a 'build a better recommendations engine' perspective) is a 5+: a short list of favorites that can stand in for someone's favorites. Personally, I'd also like a better way to better differentiate the gradations between standard, good, and great. Whether I hate something or 'hate-hate' it isn't going to make much of a difference. Do you think your audience is going to be persuaded to reduce their average rating by a point, or are you still going to find the oft-quoted 4.3 average? I'm doubtful, but this doesn't make it a a bad idea to try.

As to the downvote, I stand by it. My goal is to rearrange the page so that the comments that are most useful to me are at the top. If others find your comment useful, they will see the injustice and bring it back up to the top.

As to the need for 'strong backing', I think we just have different worldviews. With due respect to my friends who are psychology professors, "a [nameless] psychology professor told me" is barely a step up from "I'm not a doctor but I play one on TV". We obviously respect different authorities in our lives.

fuzzythinker · on Nov 3, 2009

Re: "Need for 'strong backing'": I think this goes back to the seriousness of how you take the discussions here. For me, forum/msg threads are just causal discussions (one level lower than blog posts), it would be nice if the person stated where they get their backing for the idea or assertion, but it's neither "necessary" nor "too helpful". If every idea/assertions needs backing, there would be almost no discussions at all. Innovation often comes from idea/assertions that have no backings.

It is not "necessary" because to me, if I believe in it and it's important to me, I will test it out. It doesn't matter if the idea came from Steve Jobs or Joe Doe. It's not "too helpful" because often the research being done on it is flawed, outdated, or just really not too trust-able. An example for example I remember reading some group has done research on max width of cell phones people like before feeling discomfort. The Motorola design team was the first to ignore that when they designed RAZR.

robin_reala · on Oct 30, 2009

Got anything to backup the 'users find sliders troublesome' assertion? I’d agree with you if you said developers, but I haven’t heard about any user problems.

fuzzythinker · on Nov 1, 2009

I don't have any evidence of users disliking sliders for ratings nor do I claim to. I only said I "think/believe" (as in imho). But for a fact, sliders takes more time and energy - both physical (you have to hold your mouse button) and mental (you need to think more about it) than selection buttons. Thus unless the use case calls for it, users are less likely to prefer it than buttons. Some use cases may be user settings that have more than a few choices or selecting from a number scale.

Now, the new steepster slider definitely calls for it since he uses 100 points rating scale. However, you have to look at the alternative, which for a product rating is normally a 5 point clicking UI w/o a slider. Thus, you have to look at both solutions and see if this new rating system that requires the slider is "better". If the number of user ratings does not decline, it would show that users are not bother by the slider, which means the steepster slider may be "better" for your product rating. But if a typical user finds the slider too troublesome and not bother with leaving a rating, would this new system be better than a simple 5 start system with a better distribution of samples? Steepster may have logs that shows that there isn't a decline in user ratings, but that is only 1 company (sample) with unique sets of users. Before another company follows, I believe it may be wise to ponder or do some testing/research on use of sliders vs. clicks for ratings.

stuartjmoore · on Oct 29, 2009

In this context, it looks like the user has incentive to rate (to get better suggestions), but on rating in general:

Why even ask people how they feel?

Depending on the content, you can analysis how they use it to get a much more accurate rating. For video: Did they watch the entire thing? Did they leave after a few seconds? Did they share it somehow?

That (slightly off-topic) being said, this looks great.

thinksketch · on Oct 30, 2009

This is very cool thank you. I posted earlier today about the need for a better rating system than the five star system. I'm really glad to see you working on a great solution. Thanks!

zeeone · on Oct 30, 2009

Meh...