Ask PG: Would it be possible to do an A/B test on HN?

rarestblog · on Aug 4, 2009

Actually, one of the biggest Russian IT social news sites (habrahabr) has this for quite a while - you can't see a rating of article until you vote (up, down or "zero", but you can't upvote or downvote if you vote "zero", you just get to see the rating). It's working pretty well ever since it was implemented.

On the contrary note - bash.org.ru, which is kind of a clone of bash.org, tested this idea, but then they got back to showing the rating beforehand.

raspie · on Aug 4, 2009

Yep, but bash.org.ru did not give you the option to just see the score without voting. Given how high the trash / not trash ratio is there, I think it's only reasonable that many people wanted to see a quote's score before reading it (and upvoted the quote to see it).

chanux · on Aug 4, 2009

This sounds really good. Like to see what kind of change this make on our voting habits.

prakash · on Aug 4, 2009

Sounds like an interesting experiment, but how do you measure quality?

pavel_lishin · on Aug 4, 2009

Subjectively, I'd wager.

kf · on Aug 4, 2009

You can't isolate the quality variable. The quality of the site ebbs and flows; there are good days and bad days. The quality on the days when points are off is more likely to be random than connected to whether or not the points are displayed.

oldgregg · on Aug 4, 2009

What if you could view the down votes made by any particular user from their profile page? A little passive accountability...

Personally I think the comment voting system is working alright. The overall s/n is more of an issue for me-- it would be nice to see some kind of karma based throttle for posting new articles.

jacquesm · on Aug 4, 2009

that's a good one!

There is a downside to that though. Plenty of people use the 'up' as "I agree" and 'down' as "I disagree". I know that's not the way they are intended to be used, but it opens the door to 'I don't like you' and 'I like you' votes as well.

Without having them split out like that you will then only be able to view those votes through your own perception of what was voted on.

This will eventually lead to meta-moderation (not necessarily a bad thing), that's a lot more work than a simple switch, which is one of the reasons I asked this, it's dead simple, takes 2 minutes to implement and we'll have results in a couple of days.

tokenadult · on Aug 4, 2009

Plenty of people use the 'up' as "I agree" and 'down' as "I disagree". I know that's not the way they are intended to be used

Darn, I wish I would have bookmarked the comment in which pg said it was okay to vote to indicate agreement or disagreement. SearchYC is my friend:

http://news.ycombinator.com/item?id=117171

(And gojomo was my friend too, as I first found his comment linking to pg's comment when I did my SearchYC search.)

euroclydon · on Aug 4, 2009

onclick=confirm('Are you up-modding because you feel the comment added value to the discussion?')

mkuhn · on Aug 4, 2009

Interesting idea, but:

You have to worry about measurability. How would you measure the quality of the site? And how would you do it objectively?

paraschopra · on Aug 4, 2009

Total number of comments and clicks on the articles would be the metric. Time spent on site would be a great measure too.

req2 · on Aug 4, 2009

I don't think either of those are terribly good metrics.

Using two of my own submissions...

Compare Google's blog post on robots crawling news articles: http://news.ycombinator.com/item?id=708417 to an inflammatory post about Techcrunch: http://news.ycombinator.com/item?id=658308

One of those is 'quality', and the other has comments (and 'dead' status).

mkuhn · on Aug 4, 2009

I think that both those KPIs are quite a good idea, but they can alos lead to quite wrong measurements: What if this leads to people just clicking on every comment link to find out if there is a discussion or if there is a new post... etc.

It doesn't meassure the quality of the discussion and the submissions.

jacquesm · on Aug 4, 2009

How many comments there are could be visible without any detrimental effect.

I simply suspect that the 'points' system has a shadow side and that as the site grows the shadow side starts to overpower the positive portion. By temporarily switching it off and asking the community what they felt about being 'point blind' for a short period and if they thought the quality improved or not you can make a 'metric' that is much easier to measure than some of the more technical tricks you could pull:

  Customer Satisfaction.

A simple poll after the experiment would suffice. Or you could make it switcheable on a user basis if it is a toss-up or too close to call a very clear preference.

If on average more people feel better without the points visible than with them after a short trial period then it's something that you could consider doing permanently.

Another option would be to keep author and points hidden until after you've voted for a comment, but that may have other side effects.

I think the content should stand on its own, regardless of what the voting history and the author are it is what you think about it that counts.

Right now the choice to vote or not to vote is made plenty of time based on the current number of votes, which leads to plenty of feedback loops. I've seen 'flip-flops' (bi stable and tri-stable, 0,1 and -1,0,1), positive feedback loops, and negative feedback loops.

By breaking the loop we could end up with a more balanced view.

Imagine what the effect would be of a running tally during an election, it would completely affect the outcome, and not necessarily in a positive way.

Again, it's just a gut feeling but I think there is some truth to it, and it's a very easy to do experiment, worst case we will learn that it did not work.

mkuhn · on Aug 4, 2009

Thank you for further elaborating your idea.

I think the experiment would be a great idea and doing it the way you propose with empirical data to back up the machine generated statistics could provide very valuable insights.

Also one could start experimenting with the approach/idea e.g. let the number of votes appear after voting or not displaying a number at all

paraschopra · on Aug 4, 2009

All this ultimately depends on what is the goal of HN. I believe the goal is to foster discussion. So, the number of comments would be a great metric.

Regarding clicks, I think even if people just click every comment link to find out what the discussion is about, the site/design has achieved its goal. It may reveal whether the points cloud new interesting articles.

Perhaps, this experiment could also include clouding of domain to see if techcrunch and other popular domains get extra juice thanks to its popularity.

jacquesm · on Aug 4, 2009

One thing I've learned from A/B testing large sites is that you change only one metric at the time, but the domain clouding one sounds like an excellent candidate as well.

tilly · on Aug 4, 2009

I agree that a given A/B test should change only one metric at a time. However I've had excellent results from running multiple A/B tests in parallel. As long as inclusion in each is independent and random, the results of each are informative, and if you're concerned about interaction effects you can analyze for signs of a potential interaction in a post-mortem, then do a more expensive multi-variate test if you have cause for concern.

paraschopra · on Aug 4, 2009

Yep, that is right. Though you can do multivariate analysis, sticking with plain A/B test is best.

That said, you can (and should) measure performance on multiple benchmarks like clicks, comments, time spent. Gives you a correct picture of tradeoffs.

colins_pride · on Aug 4, 2009

Doesn't changing only one metric at a time tend to lead to getting stuck in a local maximum?

jacquesm · on Aug 4, 2009

Surprisingly simple question that is very hard to answer well I think....

If you only test one variable at the time then you simplify your tests to the point where you can extract some metric to determine whether you've improved or not compared to the old situation.

Nothing stops you from then doing more ab testing with other combinations relative to your 'new best'. This may include going back to the original setting with some other variable changed, that way you avoid the local maximum problem.

So say we have a site in position 'A', we make version 'B' and we test them against each other. If we find out according to our chosen metric that 'B' performs better we now have several choices:

We can do another A/B test starting from 'B' changing some value to see if we can improve on 'B' directly, or alternatively we can go back to 'B' versus 'A' + some new modification that is not 'B'.

I hope that's clear....

joeyo · on Aug 4, 2009

If you really believe that the A/B parameter space and the C/D parameter space interact with each other, then yes, you could get stuck in a local minimum. So in that case you should test all combinations simultaneously. However, it will take a lot longer to collect enough data in this case. So if you think it is likely that the parameters are independent, it would be better to change only one set at a time.

req2 · on Aug 4, 2009

Any 'solution' to the 'quality problem' that fails to acknowledge that some people are obviously voting up the (politics|techcrunch|zed|etc.) articles you perceive as low quality is going to fail.

You need to account that some people love the hyperbole that you always find in (politics|techcrunch|zed|etc.) articles, and some people don't. This suggests implementing 'cliques'. If you upvote a submission, you grow one vote closer to the clique of people who also upvoted that submission. People whose vote history match yours very well will very easily influence your front page, and the articles of a kind you rarely upvote will be less likely to cloud your front page.

(More feasibly, you could just add some 'coolfinders' manually, whose upvoted articles automatically jump to your page.)

jacquesm · on Aug 4, 2009

I've sent you some mail about this. Very interesting indeed!

req2 · on Aug 5, 2009

I haven't had time to look it over in full yet, but I will respond when I can.

jacquesm · on Aug 5, 2009

thanks!

seanlinmt · on Aug 4, 2009

How about having different categories?

req2 · on Aug 4, 2009

Categories just invite meta posts of the 'not <category>' type. Articles that fall into multiple categories are another problem, and tags don't quite solve that one either. They also inorganically split the community and add an extra step to the voting or submission process.

anamax · on Aug 4, 2009

Is it time to start treating posts about points like political posts?

jacquesm · on Aug 4, 2009

Apologies for crossing the first rule of HN (The first rule of HN: do not talk about HN on HN) but there seems to be no other good spot to have a discussion about an idea like this.

What would you suggest I do instead ?

stonemetal · on Aug 4, 2009

I believe the feature request link at the bottom goes to an appropriate place to have such conversations.

jacquesm · on Aug 4, 2009

That thread looks quite dead to me.

req2 · on Aug 4, 2009

You could fork the source and run your own A/B testing.

sanswork · on Aug 5, 2009

Except that wouldn't be sampling the HN community which would make the test unable to answer the original question.

req2 · on Aug 5, 2009

Code in hand goes further than request in hand.

tokenadult · on Aug 4, 2009

I believe the feature request link at the bottom goes to an appropriate place to have such conversations.

Indeed it does. I count on the curators of the site to read that thread.

youngian · on Aug 4, 2009

This is the first meta-discussion that hasn't started off on a sour note for me, because the OP is not just saying "Heyyyyy I hate it when people downvote boo hoo etc." He's offered his thoughts on the causes of downmodding tension and has put forward his idea of how to approach the problem. Sure, it's still meta-discussion, but at least it's productive.

anamax · on Aug 4, 2009

I didn't criticise meta-discussion in general; I only commented on points discussions.

Hiding points is not a new proposal. It came up yesterday (under http://news.ycombinator.com/item?id=739969 ) and has probably come up before.

My success criteria would be whether hiding points results in fewer points discussions than exposing them.

jacquesm · on Aug 4, 2009

> My success criteria would be whether hiding points results in fewer points discussions than exposing them.

That's an excellent observation.

kashif · on Aug 4, 2009

http://news.ycombinator.com/item?id=169479

kashif · on Aug 4, 2009

http://news.ycombinator.com/item?id=169662