The basic model that Optimizely uses is a Z-Test approximation of a binomial distribution. To run a proper experiment with that model, you should be calculating the sample size ahead of time, and then run it. Each visitor should be independent, and not affected by things like the day of the week, or the time of it. The end result tells you if the distributions are different, but not as much as one would think about the size of the differences. It also can't be 100. The normal distribution has an infinite range, so a finite limit can never capture 100% of it.
Optimizely is in a rough spot. People don't like having to think through experimental design, and they are really, really bad at reasoning about p-values. To try to fix the people part, they came out with the sequential stopping rule stuff (their "stats engine), but they never really published much justifying it. The other alternative would be to move the experiments into a Bayesian framework, but that has a lot of it's own problems. When they acquired Synference, that was one of the likely directions to take (along with offering bandits), but that didn't work out and those guys have since left.
[Disclaimer: I previously worked for Optimizely as predictive analytics PM - but no longer work there, and don't speak for the company.]
Optimizely has a bandit based 'traffic auto-allocation' feature in production on select enterprise plans [1]; bandits are excellent in a wide range of situations, and have many advantages, but like anything, have design parameters and there are some caveats you have to be aware of to make sure you are using them effectively.
On Frequentist and Bayesian:
Optimizely's stats engine combines elements of both Frequentist and Bayesian statistics. They have a blog that tries to touch on this issue [2]
But this is subtle stuff - and there are a lot of trade-offs, and different perspectives; look at the Bayesian/frequentist debate which has been going on for decades among statisticians.
But, FWIW, I definitely saw Optimizely as an organisation make a big investment to produce a stats engine which had the right trade-offs for how their customers were trying to test; and I think the end result was way more suitable than 'traditional' statistics were.
[1] https://help.optimizely.com/hc/en-us/articles/200040115-Traf...
"Traffic Auto-allocation automatically adjusts your traffic allocation over time to maximize the number of conversions for your primary goal. [...]
To learn more about how algorithms like this work, you might want to read about a popular statistics problem called the “multi-armed bandit.”"
[2] https://blog.optimizely.com/2015/03/04/bayesian-vs-frequenti...
"Yet as we developed a statistical model that would more accurately match how Optimizely’s customers use their experiment results to make decisions (Stats Engine), it became clear that the best solution would need to blend elements of both Frequentist and Bayesian methods to deliver both the reliability of Frequentist statistics and the speed and agility of Bayesian ones."
I didn't realize that the auto-allocation ever shipped, but I'm glad it finally did. Hopefully there was work done to empirically show that they solved a lot of the issues around time to convert and other messy parts of the data that killed earlier efforts, but I think everyone who knew about those was gone before you joined :)
There are very subtle issues with both frequentist and bayesian stats, which makes combining them sounds insane to me.
> but they never really published much justifying it
Not that I'm trying to defend Optimizely (I'm not a huge fan, but for other reasons...).
I can't vouch for the quality either, but they did publish something about it[0] - that at least looks quite scientific. Happy to read any critique of course.
Latex is a wonderful way to make a marketing paper look like a scientific one. It doesn't accurately describe the method, but that isn't really its purpose. It's a more technical description of the blog post, meant for people using the product to understand some of the tradeoffs and get more accurate results.
They are still having people make very fundamentally flawed assumptions about the data, which results in incorrect conclusions, and they are still not presenting the results in a way that people correctly interpret them. That being said, those are really hard to solve, and models that would try to correct for them would likely require a lot more data and be overly conservative for more people.
Hi, this is Leo, Optimizely's statistician. If you're looking for a more scientific paper, maybe take a look at this one we wrote recently: http://arxiv.org/abs/1512.04922
Should have everything you would ever want to know about the method.
I agree with you that the problem of inference and interpretation between A/B data, algorithms, and the people who make decisions from them is a hard one and worth working on.
That said, I do think the two sources of error our stats engine addresses - repeatedly checking results, and cherry picking from many metrics and variations - did make progress in having folks correctly interpret A/B Tests. This did result in more conservative results, but the benefit was that the variations that do become significant are more trustworthy. I think this was absolutely the right tradeoff to make for our customers, and trustworthyness is a pretty important aspiration for stats/ML/data science in general.
Of course I did write the thing, so I'm not very impartial.
I was tempted to make a snarky comment about using LaTeX, but I'm not sure it's entirely fair. It doesn't seem like just a bunch of MarketingSpeak wrapped in LaTeX to be honest.
My issue with Optimizely are mainly how they essentially ditched us as (paying) customers. We were admittedly small-fish, but we were paying and were willing to pay more, but they switched to Enterprise-vs-Free without any middle grounds. Enterprise was way too expensive for us. Free didn't include essential features, so we were just stuck.
I ended up writing an open-source javascript A/B test client[0] (and recently also an AWS-lambda backend[1]), but it still has a way to go...
It may be marketing but I was able to implement a sequential A/B test based on it. Admittedly, I did need to do some work beyond merely copy/pasting an algorithm, but all I really needed to do was read their paper and some citations. I do believe that this document does describe a viable frequentist test and my implementation of it worked pretty well.
Disclaimer: I do stats work at VWO, an Optimizely competitor.
Optimizely did an A/B test that showed that customers respond better to rounded-up numbers. ;-) the real world is messy.
Years ago my team's statistician did a competitive review of various AB test apps, and reported various ways in which the UIs make statistically invalid statements to the user.
You're wrong. What you are thinking about is Optimizely's "Chance to Beat Baseline" number. That's different from the statistical significance, which is a setting you can change on the Settings page.
Being smug and condescending really backfires when you don't know what you're talking about.
That's not how statistical significance works...