Here's what everyone is missing. Don't use bandits to A/B test UI elements, use ...

whatever_dude · on April 6, 2016

Love that example, even if frequently retold.

http://blog.codinghorror.com/quantity-always-trumps-quality/

kens · on April 6, 2016

I tried to find the original source of the quality-vs-quantity pottery class story a while back. I think it originates in the book "Art and Fear" but in that book it reads like a parable rather than a factual event. I'm highly suspicious of whether this event actually happened. Anyone have solid evidence?

areyousure · on April 6, 2016

Having researched this myself moderately extensively, I do not believe that the event actually happened.

Moreover, the book "Thinking fast and slow" does not contain the word "pottery" (nor "ceramics").

SonicSoul · on April 6, 2016

yeah this is referenced all over the place (Derek Sivers, Jeft Atwood, Kevin Kelly) but always just this one paragraph and comes from this book http://kk.org/cooltools/art-fear/ don't see any references there either.

orasis · on April 6, 2016

It was featured in "Thinking Fast and Slow", and those authors seem quite academically rigorous.

tripzilch · on April 7, 2016

Yes they're very good at "seeming" the part. Until you check the cites/references.

Please excuse me for lumping Kahneman, Tversky and Taleb into the same bin (stating beforehand, feel free to dismiss or form your own opinions because of this). My justification is they cite eachother all the time, write about the same topics and are quoted as doing their research by "bouncing ideas off each other" (only to later dig up surveys or concoct experiments to confirm these--psych & social sciences should take some cues from psi research and do pre-registration).

This is now the second time I notice that one of their anecdotes posed as "research" doesn't quite check out to be as juicy and clear-cut (or even describing the same thing) as the citation given.

The other one was the first random cite I checked in Taleb's Black Swan. A statistics puzzle about big and a smaller hospital and the number of baby boys born in them on a specific day. The claim being research (from a metastudy by Kahneman & Tversky) showing a large pct of professional statisticians getting the answer to this puzzle wrong. Which is quite hard to believe because the puzzle isn't very tricky at all. Checking the cited publication (and the actual survey it meta'd about), turns out it was a much harder statistical question, and it wasn't professional statisticians but psychologists at a conference getting the answer wrong.

Good thing I already finished the book before checking the cite. I was so put off by this finding, I didn't bother to check anything else (most papers I read are on computational science, where this kind of fudging of details is neither worth it, nor easy to get away with).

Which is too bad because the topics they write about are very interesting, and worthwhile/important areas of research. I still believe it's not unlikely that a lot of their ideas do in fact hold kernels of truth, but I'm fairly sure that also a number of them really do not. And their field of research would be stronger for it to figure out which is which. Unfortunately this is not always the case for the juicy anecdotes that sell pop-sci / pop-psych books.

ves · on April 7, 2016

You should read silent risk (currently freely available as a draft on talebs website). It's mathematical, with solid proofs and derivations, so it doesn't really suffer from the same problems as The Black Swan. I had basically the same issues with the Black Swan that you did.

tripzilch · on April 8, 2016

Thanks for the tip, I'll check it out! On that note, I also read Fooled by Randomness, an earlier work by Nicholas Taleb. It covers (somewhat) similar topics as Black Swan does, but from a bit more technical perspective. I found it a lot more pleasant and educational read. IIRC, it talks less about the financial world and rare catastrophal risks ("Black Swans"), and more about how humans are wired to reason intuitively quite well about certain types of probability/estimates/combinatorics, and particularly on how we really suck at certain others.

Fooled by Randomness also doesn't needlessly bash people for wearing ties or being French, like Black Swan does :) ... only later did I learn that Taleb was a personal friend of Benoit Mandelbrot (who was French), which put it in a bit more context (as a friendly poke, I'm assuming), but at the time I found it a bit off-putting and weird, as it really had nothing to do with the subject matter at all.

orasis · on April 7, 2016

This is very interesting, thank you. Modern non-fiction is so frustrating.

SonicSoul · on April 6, 2016

that's an appeal to authority, sounds like @kens actually tried to verify this story. I'd be curious to hear more examples as well. I believe I heard something similar on You're Not So Smart podcast but that might have referenced the same example

soperj · on April 6, 2016

I do think the example is bullshit, simply because you could throw ONE very thick pot and have it weight more than 50 pounds. Of course it's a horrible pot and you wouldn't get any better since you only made one piece, but it would get you an A since it would use more clay. Total # of pieces thrown would make more sense than pounds of clay used.

coverband · on April 7, 2016

(Not that I think the anecdote is real, but) that's the idea -- the parable was about how the pot with the best quality was found in the 'quantity' team even though this wasn't their objective.

diab0lic · on April 7, 2016

Forgive me if I've missed something but the original comment says the group that threw the most clay scored higher on creativity, not scored higher on pounds of clay thrown.

soperj · on April 7, 2016

Guess you missed something. "The second group was graded on only the total number of pounds of clay they threw."

mrspeaker · on April 6, 2016

I don't understand what you mean by "use them to optimize your content" - how are you doing that with your app? Are you serving different messages to different groups of people? How are you grouping/testing/rating them?

orasis · on April 6, 2016

My bandit system generates an ordered list of the content for each individual user. I then will track if the user came back tomorrow or churned. Yes, they may churn due to many other factors, but the signal of the content itself is strong enough.

In the past I have used the share rate to optimize, but I've realized that retention is more important.

andreasklinger · on April 6, 2016

i dont know the app but i assume he has a core action (-> which he iterates/churns out) and evaluates if people act (-> used as rating)

eg (guesses here)

new notification types "think for 9 seconds about a loved one" "what was your favorite vacation, go there"

if people act -> reward and reuse if not -> reuse less (only by random chance)

alternative to notification could be "completed medidation screen"

wlesieutre · on April 6, 2016

> The first group would have the entirety of their grade based on the creativity of a single piece they submit. The second group was graded on only the total number of pounds of clay they threw.

I feel like that works partly because an important part of practice the feedback loop between continually practicing and having a sense of whether you did well or not.

Your strategy of not evaluating your own work sounds a bit like mushing clay into shapes with a blindfold on and then tossing it in kiln before you even check whether or not it's shaped like a pot. The users can sort through them later!

If the end goal is just ending up with a volume of work that's been culled down to the better ones, I guess you still get that. But it's inherently different from the Thinking Fast and Slow example where they're in a class and the goal is to learn and get better, rather than see who's made the nicest pot by the end of the semester.

orasis · on April 6, 2016

I get the same effect using bandits. I practice my craft by spewing out volume rather than focusing on quality. The penalty I pay for the bad content is negligible because the bayesian bandits cull them very quickly. I am learning and getting better, but it is because I am not paralyzed by perfectionism.

hinkley · on April 6, 2016

You might want to be careful with your conclusions.

We don't know from this study whether the second group was more or less likely to get trapped in the Expert Beginner phase of development.

You definitely don't get anywhere without practice, but you are likely to get nowhere fast without theory.

doug1001 · on April 7, 2016

insightful analogy.

so then "Expert Beginner phase" = local minima, which might be reached quickly compared to a more systematic search (practice informed/infused with theory) and once in that local state, the "expert beginner" might be unable to escape because of over-reliance on small, poorly guided, step size; the real prize anyway is the global minima.

spyder · on April 7, 2016

solid 5 stars, 100+ reviews because I use bandits to optimize my content.

It's great app, but were the ratings lower in the beginning before the optimization or how do you know the optimization helped the ratings? I'm asking because it seems the app could have good ratings regardless of the content optimization because it's a "feel good" app. Are there counterexamples of of other meditation apps were the UI is good but it has bad reviews because of their low quality content?

fapjacks · on April 6, 2016

There were two epiphanies hidden in this comment. HN is good for one or two every once in a while, but for me personally, this struck gold. Thanks!!

tswartz · on April 6, 2016

Good point! What do you use to do in-app A/B testing?

orasis · on April 6, 2016

I implemented my own that I'll be making available for others soon at improve.ai (the site isn't up yet)

tswartz · on April 6, 2016

cool! good luck. let me know what it is when it launches. I'm in the market for a good iOS sdk. @tylerswartz

searine · on April 6, 2016

>I am free to just spew out content regardless of its quality.

Oh, that's a great goal to have...

orasis · on April 6, 2016

By focusing on volume, I get quality as a side effect. I'm very proud of the the app I've created and the feedback I get from my users:

"This app gives a text reminder to do what everyone wants to do: relax, love one's self and others, and bring peace and light into the world. I smile when the alert message appears and feel grateful to the makers of this app for creating a pleasant, quick way to meditate at the most stressful point of my day. I have recommended it to many people"

https://itunes.apple.com/us/app/7-second-meditation-daily/id...

searine · on April 7, 2016

Volume and efficiency are great, but nobody likes a shit cannon.

squeaky-clean · on April 6, 2016

I actually think that's a great idea, because then the users and algorithm decide what's good for you, and it will hide the low quality content automatically.

You have to do that sort of "curation" anyways anytime you make something. You have to continually decide whether it's worth it to keep working on something, and then decide if it's good enough to release. People tend to be pretty bad judges of this (especially for more creative tasks. There are many examples of someone's most famous song/book/painting not being their own favorite, or even being their least favorite!).

So why not relieve some of that stress, and let the users pick what they really like, instead of you guessing for them?

searine · on April 7, 2016

"I promise, there is a diamond somewhere in this bathtub of shit".

Efficiency is great, but there still needs to be some quality standards.

harperlee · on April 12, 2016

Most probably the input is not random words, and instead orasis writes pieces aiming to have quality. He just doesn't bleed each and every word and accepts a maximum time per text piece, accepting that some variation in quality is inescapable, and that it is most efficient to allocate less time per piece so he gets more pieces. Note that this "less time" is relative, and probably does not go to zero. It's just orasis has found a level of effort per piece that is effective, and he relies on the algorithm so he does not need to establish a threshold below which he will not publish - as those pieces will be swallowed by the rest of higher quality corpus.

He might have leaned toward a little bit of hyperbole in "just spew out content regardless of its quality" for the purposes of clarity and expressiveness, which most of the readers have correctly parsed, but I don't think you have the right to tag orasis' output as shit based solely on the information of the page. That's a little bit rude, in fact.