Information is as useful as your ability to act on it--no more, no less. Real-time analytics is something that sounds sexy and gets a lot of headlines (and probably sales), but it's not particularly useful, especially compared to the cost to implement. Most organizations aren't capable of executing quick decisions of any significance. In fact, quite a few business models wouldn't have much to gain even if they were capable of it.
My experience is that there are three types of companies, with very little overlap:
1. Companies large enough to receive statistically significant amounts of data in under an hour.
2. Companies small enough to make decisions regarding significant site updates in under an hour.
3. Companies whose name is "Google."
Fact of the matter is, any change to your site more significant than changing a hex value will require time overhead to think up, spec out, test, and apply. Except in the most pathological cases of cowboy coding, it will take at least a day for minor changes. Changing, say, the page flow of your registration process will take a week to a month. You won't be re-allocating your multi-million-dollar media budget more often than once a quarter, and you have to plan it several months in advance anyways because you need to sign purchase orders.
In short, you can usually wait 'til tomorrow to get your data. Really, you can. Sure, you can probably stop an A/B test at the drop of a hat, but if it took you a week to build it, you ought to let it run longer than that.
I have had one client who really did benefit from real-time-ish (same-day) data. It was a large celebrity news site. They could use data from what stories were popular in the morning to decide which drivel to shovel out that afternoon. This exception nonetheless proves the rule: Of the 6 "requirements" listed in the article, only 1.5 were needed in this particular case: hard yes on accessibility, and timeliness was relaxed from 5 minutes to 30.
(Note that when I say analytics, I mean tools for making business decisions. Ops teams have use for real-time data collection, but the data they need is altogether different, and they are better served by specialized tools).
"They could use data from what stories were popular in the morning to decide which drivel to shovel out that afternoon."
This doesn't seem to be the same thing as a business decision to me - this is more of a process. It's the same thing as in a JIT supply line, such as ordering more car doors because the line ran quicker in the morning. What you've done in this case is set up a just-in-time celebrity-drivel creation process and are making standard operational decisions instead of business decisions.
If you were deciding to change how the share functionality of the site worked because people were not using it in the morning, then this would be a business decision, and I have a feeling you would not make this decision on just a morning's worth of data.
Google is in category 1, but the realities of how Google runs are so different from most category 1 companies that generalizations about category 1 companies tend to break for Google.
Likewise note that category 1 companies are also in category 2, but the realities of being a category 1 company is very different than the realities of being a category 2 company.
Gah, yet another article that links to Evan Miller's article on how to not run an A/B test. I really need to finish writing my article that explains why it is wrong, and how you can do better without such artificial restrictions.
His math is right, but the logic misses a basic fact. In A/B testing nobody cares if you draw a conclusion when there is really no difference, because that is a bad decision that costs no money. What people properly should care about is drawing the wrong conclusion when there is a real difference. But if there is a significant difference, only for small samples sizes is there a realistic chance of drawing a wrong conclusion, and after that the only question is whether the bias has been strong enough to make the statistical conclusion right.
He also is using 95% confidence as a cut-off. Don't do that. You don't need much more data to massively increase the confidence level, and so if the cost of collecting it is not prohibitive you absolutely should go ahead and do that. Particularly if you're tracking multiple statistics. If you test regularly those 5% chances of error add up fast.
He also is using 95% confidence as a cut-off. Don't do that. You don't need much more data to massively increase the confidence level, and so if the cost of collecting it is not prohibitive you absolutely should go ahead and do that.
Statistical significance grows roughly like the square root of the number of samples. Moving from 2-sigma (95%) to the physics gold-standard of 5-sigma requires drastically more data in almost all cases.
Selecting a measurement's uncertainty is something which should be carefully considered. Sometimes you only care about something to 10%, sometimes a 1-in-a-million part failure kills someone's Mom.
If you're doing lots of A/B testing, where trials penalties add up, it might be worth looking into the way that LIGO handles False Alarm Rates. They have to contend with a lot of non-Gaussian noise/glitches.
Statistical significance grows roughly like the square root of the number of samples.
No, no, no. You are confusing the growth of the standard deviation (which does grow like the square root of the number of samples) with the increase in certainty as you add standard deviations. That falls off like e^(-O(t^2)) where t is the number of samples. This literally falls off faster than exponential.
What does this mean in the real world? In a standard 2-tailed test you get to 95% confidence at 1.96 standard deviations, 99% confidence at 2.58 standard deviations, and 99.9% confidence at 3.29 standard deviations. These numbers are all a long ways away from 5 standard deviations.
Let's flip that around and take 95% confidence as your base. If you are measuring a real difference, then on average 99% confidence requires a test to get 32% more data, and 99.9% confidence requires a test to get 68% more data. Depending on your business, the number of samples that you get are often proportional to the time it takes to run the test. If making errors with x% of your company involves significant dollar figures, the cost of running all of your tests to higher confidence tends to be much, much less than the cost of one mistake.
That is why I say that if the cost of collecting more data is not prohibitive, you shouldn't be satisfied with 95% confidence.
Assume a random variable is barely resolved at 1-sigma off zero with N samples. If I wish to increase my confidence that it really is off zero (and the mean with N samples is actually the mean of the distribution), then I'll need 4N samples to halve my uncertainty and double the significance of the observation (as measured in sigma-units). It is in that sense that the significance of a measurement increases like \sqrt(N).
Viewed from my perspective, if you'd like to go from 2-sigma (95%) to 3.29-sigma, you'd need (3.29^2)/(2^2)=2.7 times the amount of data used to get the 2-sigma result, or 170% more samples.
It looks like you've reached your conclusion that I'd need 68% more data to reach 99.9% by taking 3.29/1.95=1.68. I believe that this is in error. Uncertainty (in standard deviation) decreases like 1/\sqrt(N), not 1/N.
\sqrt(N) has driven me to depression more than once.
Neat to see a different side of a coin. In our lab, individual measurements can take as long as a year. \sqrt(N), when constrained by human realities, presents a wall beyond which we cannot pass without experimental innovation.
As the derivative of \sqrt(N) is 1/2*1/\sqrt(N), your first measurement teaches you the most. Every measurement teaches you less than the last. In general, we measure as much as we must, double the size of the dataset as a consistency check, and move on. The allocation of time is one of the most important decisions of an experimenter.
Ah. Well I talk about the cost of data acquisition for a reason.
I've seen a number of businesses who have a current body of active users, and this does not change that fast. So when they run an A/B test, before long their active users are all in it, and before too much longer those of their active users who would have done X will have done X, and data stops piling up. In that case there is a natural amount of data to collect, and you've got to stop at that point and do the best you can.
Businesses are as alike as snowflakes - I am happy to talk about generalities but in the end you have to know what your business looks like and customize to that.
When you make confidence tests on a continuous process, you will hit this 1 day in 100 on average for 100 days of testing at 99%. If you're doing multiple AB tests per day, well...
EDIT: And another important issue: if you are testing on Valentines day and the pink color/A test is beating out the orange color/B test, then this result may only happen on that one day per year. This applies to hot news topics and numerous other fads as well.
Also make sure your tests really are independent - if you do A/B testing off of fixed IP address hashing and have a small user base, by random chance you may have your top customer always being in 'group A' and giving group A a higher than expected base average - an extra bonus click for every group A test may skew your statistical tests as they assume independence.
Counter point to your first - do I, as the Practical A/B testing Guy, really care if I get one in 100 tests wrong?
If I'm running 100 A/B tests a year, 50% of which produce a 5% improvement in my conversion rate, who cares if I get one of them wrong and drop 5% instead? I'll still have improved my conversions by 1100% over the course of the year.
(Obviously, I'm not taking into account local maxima here - but any statistical misstep will only temporarily roll your testing down the hill, and subsequent tests will push back up to the local maximum anyway.)
In reality no, because nobody will notice that the company is about 10% smaller than it could have been, and you're doing good stuff. However the cost of that 10% change in your company probably exceeds the cost of running 100 tests.
Note that in the real world, after a while you exhaust the easy 5% wins. Then you wind up chasing a lot of 1% and 2% marginal wins, and those tests take a lot longer to run. But even so, if your company is pulling down, say a million per month, a 2% drop in business from getting a 1% test wrong probably exceeds the costs of all of the tests you run per year. So unless extending tests is prohibitively expensive, or there are significant opportunity tests from not being able to run the tests you want, you should go to higher confidence.
In my experience, the primary limiter of how many A/B tests you can run is available traffic.
Let's assume that I'm running the 2% tests you mention. I can either run 60 2% tests at 95% confidence, or (approximately) 30 2% tests at 99.9% confidence, if the math posted elsewhere in this thread is right, over a year, say - that's how much traffic I have. Let's once again assume a third of those pan out.
I'm still not seeing why I'd prefer 10 2% wins (from the 99.9% approach) to 20 2% wins plus 3 2% losses (from the 95% approach). Yes, there are more errors, but overall I end up with a 40% improvement as opposed to a 24% improvement.
On the first point - I may be being dense here. Do you mean a) segmenting traffic, and running A vs B on one segment, and A vs C on another, b) A/B testing, say, headline and CTA on the same page at the same time, or c) testing different parts of the funnel at the same time?
On the second point - this may be an industry difference, but I've never really had any problem coming up with things to A/B test on a landing page, for example. Just off the top of my head, you could test:
- Headlines - at least 5 of them.
- Headline and subhead vs no subhead.
- Font face.
- Color scheme (overall)
- Images - probably want to test 6-10 of them.
- Call To Action (CTA) text.
- Call To Action button vs link
- CTA placement
- Multiple vs single CTAs.
- Long copy vs short.
- Layout. Image on left, right or top? More than one image? Etc. On the average LP I could come up with 10 possible layouts before pausing for breath.
- Testimonials. PLacement, which ones, how long, picture vs audio vs video vs text.
- Video vs image-and-text.
- Ugly vs good-looking.
- Text highlighting, bolding etc - yellow highlighter style vs bold and italic vs nothing.
- Other social proofing elements - which media icons to use, where to place them, etc.
That's at least 50 A/B tests right there, on a single LP. And all of those elements have been shown in one test or another to affect conversion rates.
I mean that you can take the same traffic, using random assignment, and assign it into multiple A/B tests at once. Sure, there may be interaction effects, but they are random and statistically your evaluation of each test is unaffected by the others.
You need to be careful if you believe that there is reason to believe that tests will interact. For instance if you're testing different font colors, and different background colors, the possibility of red text on a red background would be unfair to both tests. But in general if you avoid the obvious stuff, you can do things in parallel. (If you have enough traffic you can analyze for interaction effects, but don't plan on doing that unless you know that you have enough traffic to actually follow the plan.)
Re-reading, I realize that I was as clear as mud here about interaction vs interaction.
The first paragraph is talking about random interaction. So, for instance, version A of test 1 was really good, and version B of test 2 got more A's from test 1 than version A of test 2 did. This gives version B a random boost. As long as things are random, it is OK to completely ignore this type of random interaction from the fact that you are running multiple tests on the same traffic.
The second paragraph is talking about non-random interactions. People who are in version A of test 1 and also in version B of test 2 get a horrible interaction that hurts both of those. If you have reason to believe that you have causal interactions like this, you can't ignore it but have to think things through carefully.
You raise a bunch of good and valid points. Please do finish your article. I don't view "how not to run an A/B test" as the final authority on this without qualifications, either.
So I built and use a realtime analytics dashboard that tracks revenue, projected revenue, revenue by hour for a portfolio of social games. I find it incredibly useful, but I will give a couple tips that address some of the issues in the article:
1) You have to provide context for everything. Current real time revenue is presented right next to the 14 day average revenue up to that point in time, and also how many standard deviations the delta between the two is. Ie: Current revenue is $100 at 10am, vs. 14 day average of $90, which is 0.2 standard deviations of revenue at that time.
2) Hourly revenue is presented the same way, right next to the 14 day average revenue for that hour and the SD delta.
3) Look at it a lot. I've been looking at this sheet regularly for over a year now, and I have a really good feel/instinct for what a normal revenue swing is, and an even better feel for the impact of different features/content/events/promotions on our revenue.
4) This approach also works better when the impact of your releases is high. A big release typically spikes revenue 2-3 SD above baseline, and causes an immediate and highly visible effect. So while I'm not strictly testing for statistical significance, it's one of those things where it's pretty obvious.
5) It also works better if you use it in conjunction with other metrics. We validate insights/intuitions gained from looking at realtime data against weekly cohorted metrics for the last several months of cohorts.
While this sounds very cool, and my inner geek would love this dashboard on the wall of the office - is it actually useful or is it a distraction?
What actions or decisions would you make within minutes of seeing the results? If product changes take days or weeks, daily analytics is just as useful, and stops people wasting time on looking at the data more than once per day.
There are three major classes of decisions that get made from this. In descending order:
1) We generally have pretty tight revenue targets, so one thing I and my teams do now is forecast out expected revenue on a daily basis for the next 30 days, as revenue is highly dependent on a stream of new content, features and what we call live ops.
Live ops are basically a combination of promotions, contests, events and tournaments. We have decent tools, so we can generally spin up a live op in a matter of minutes to a couple hours without doing an engineering push. We also have content we can release with a few hours of work, so if today is not going to go well, it gives us enough time to prep something for tomorrow.
It's highly useful to know if a Live Op or new feature/content release is "working" or not. With the real time dashboard and experience, I can know with high confidence whether it's working within 30 minutes of release, which gives me more than enough time to spin something else up so that we don't lose the day.
2) On many days, we actually do multiple releases or live ops in a single game -- the pace of releases is fast even compared to other web companies. As unscientific as this is, it's often easier for me to evaluate/get a feel for the impact of a release/live op based on how much it moves the projection in the first hour after release, and before the next thing comes along.
Obviously with stuff like that, you validate after the fact against more full data, but I've found after looking at this lot for over a year, I have a strong "feel" for what's working and what's not.
3) Related to 1) and 2), but I've found that using the tool gives me a much tighter feedback loop. When I make a feature release or live op and I'm sitting on the dashboard watching the release move or not move the projection, I have a much tighter feel on the impact. Most of my peers who don't have something like this often are not as rigorous in understanding and evaluating the different levers they have to impact revenue.
I like this rant. Seldom do I see the need for a real-time system and sometimes I think engineers and program managers gravitate towards the concept to better answer questions of "why" a problem happens. But analytics problems most of the time can't be solved in real time. You have to put on your thinking cap, take a step back, do some background research, and be patient. And as an analyst it is bad for your credibility to jump to conclusions. Unlike engineering, it better to be slow and right on your first try than "move fast and break things".
Nice post. Ops guys though, like to see the bushes rustling right away so that we can reboot that switch before all hell breaks loose :-)
The central theme is a good one though, tactics or strategies have an innate timeline associated with them, and deciding on tactics or strategies with data that doesn't have a similar timeline leads to poor decisions. The coin flip example in the article is a great one.
Ideally one could say "What is the shortest interval of coin flips I can measure to 'accurately' determine a fair coin?" And realize that accuracy asymptotically approaches 100%. One of the things that separate experienced people from inexperienced ones are having lived through a number of these 'collect-analyze-decide' cycles and getting a feel for how much data is enough.
> It's important to divorce the concepts of operational metrics and product analytics.
I don't think operational and product analytics are different in principle. In both cases, the best speed for analytics matches the speed of taking action on the results.
In ops, there are people whose very job is to respond to a server crash in 2 minutes. In product design, there are no actions taken within minutes of seeing the change in data - so faster analytics just wastes time on refresh-procrastination, and encourages needlessly hasty decisions.
One of the things that separate experienced people from inexperienced ones are having lived through a number of these 'collect-analyze-decide' cycles and getting a feel for how much data is enough.
If you're going off of "feel" instead of statistics then you're doing it wrong. Period.
>If you're going off of "feel" instead of statistics then you're doing it wrong. Period.
I disagree, at least to a point. Take an experienced Operations/monitoring guy who's been around the block more than once, then sit him in front of the monitoring utilities for a new company developing a new service.
Then, take a total newbie and put him in the same place.
Train them both to equal skill on your tools and operations.
Who do you think will make the most proper calls? Why?
At this point, those statistics and that documentation do not exist yet. What constitutes a "false positive" vs a "drop everything and spin up more VMs and get on the load balancer" can be more of an art than a science, especially when you're first starting out.
As hokey as this sounds, certain systems have a "personality" that varies between installations and companies, that nothing short of day to day use will educate one in.
For operations, I agree. A lot of the numbers you have don't have a rigorous statistical interpretation - for instance is a load average of 20 fine or a problem? Depends if you're looking at the Oracle database.
But the original article was talking about A/B testing, and that is the context that I was thinking about. There you both can and should use statistics.
It's possible (though deceptively difficult) to train one's statistical intuitions to be pretty accurate, or at least accurate enough to give them a positive EV. I was a professional poker player for several years, and this is what becoming a strong player is all about.
There's a lot of intuition and experience that is required to be a talented statistician; once the test has begun things are pretty much locked down (ie, if you know the math there is only one "right" answer from the data). The intuition and experience side of things comes with how the test itself is setup. Are things properly controlled, are you collecting the right data for what you're testing, is the data distributed properly for the test statistic you're using...the list goes on and on and on and on and on.
There are plenty of places that require intuition and these are the places that errors are often introduced. Statistics is an art.
You can build a mathematical intuition. This is a valid thing. You just know how to dig down and prove you are right if asked, so it's ok to leave out details.
Of course you are, my comment was that when you are presented with a decision and the data that was collected and analyzed that went into it, if you've seen a few of these you are much better at catching problems.
If you are the one doing the collecting and analyzing and you just think it feels like enough data but you can't actually reason to that point, then yes, you are doing it wrong.
I once interviewed for a lead webdev role at a small startup. They had 10-12 people, and a product that was doing OK. (I was thoroughly unconvinced by it, but that's another story). One of the things they talked about was their upcoming plan to build a real-time analytics system to track user behaviour. A big project! That I would get to spearhead! They'd budgeted 2-3 months and 6-8 people to implement it. We talked about their plans for a bit, before I asked (what I thought was) the obvious question:
"So, what's the real-time system going to help you decide that the current system won't?"
There is a long, uncomfortable pause as the two people look at each other, each hoping the other will answer.
"Well... it's not so much the real-time element, per se..." one managed. "But we want more granular data about how people are using our app."
"Okay. But you're currently doing analytics via HTTP callbacks, right? Why not just extend that to hit some new endpoints for your more granular data? You've already got infrastructure in place on the front and back end to support that."
No answer. We moved on. I don't know if I actually saved them 1-2 man-years of work or if they plowed ahead anyway.
And we shouldn't have calculators because we may forget the relationships between numbers?
I use analytics to do significant A/A testing on every configuration the sites users are actually using to determine what will work for my A/B testing later...
Should I maintain a separate realtime analytics or delay deployments by 24 hours when I would like a little more assurance? This is not a rhetorical question, whether I should keep maintaining separate tracking for the 20% of the time where google analytics is unfit is an open problem for me.
Similarly, I would like to know if there is a sudden plummet in some demographic the second I start a test. It usually isn't significant, but the client panic will be. It is better to cancel the test and do a post-mortem before restarting.. A B test doesn't have to get its day in court.
Giving delayed numbers for routine reports is perfectly valid, dressing up that pig is luddism.
Presumably anything along the lines of the "demographics drop" you mention would be in live operational metrics if it is something that is an effective metric for monitoring the health of the system.
While I agree with the basic premise that Real-time analytics are rarely helpful, here are a couple places where they could be very useful:
* Conferences - Being able to see live user analytics on a conference site, since it is ephemeral, would be great.
* Pop-up Sites - Again, the short nature of the site means seeing a blocking action or a broken link early is tremendously valuable.
Basically there are a couple circumstances where real-time analytics might make sense, but they're generally short duration engagements. Getting analytics info for a site which is no longer being hammered is useless unless it's a long term project.
What action will the conference take based on any immediate information or do you mean the information will go on the site?
Broken links etc. is probably in the operational category although a validator is a better solution for that issue. Logging errors and maybe tracking accesses to ensure every page is being reached can be done by realtime operational stats and doesn't contradict this article.
So think of conferences where there's a site up for one day.
If one wants users to clickthru to a particular page, but they're all going elsewhere, that's something that's only actionable on the day of the event (as changing it 12 or 24 hours later does one little good).
I see your point about not contradicting the article, but I think there are instances where evaluating the performance of a website in real time (for time-sensitive events) could have a real impact.
I don't think we're disagreeing so much as talking about the same point from different angles.
"You just need to understand cause and effect," said Apollo.
"He's right, mortal. This isn't what you would call rocket science," added Athena.
"Okay, and my business will succeed if I can understand cause and effect?"
"Yes," said Apollo.
"Of course! Why are you wasting time? Go write some software", said Athena.
So yeah, real-time A/B testing seems like a bad idea, but real-time analytics sounds fine. On the other hand, maybe the Gods gave you the idea of cause and effect to destroy you. I bet more than one story on hacker news today pretends to understand the causes for an effect.
I agree with this in general, but there are exceptions. For example, it would be nice to know immediately if a new change has caused your conversion rate to drop precipitously for some reason, so that you can turn it back off and take a minute to see if you can figure out why before you lose a full day's worth of revenue.
Agreed- seeing real-time changes are helpful to respond to drop-offs or spikes.
Also, if you are aware of the general trend of Tuesdays being higher traffic/results than Saturdays (to take his Etsy example) and don't take those to heart around product decisions, then watching real-time numbers to respond to changes as they happen can help you hop on waves with supplemental content or messaging.
Interesting post though I feel the author is somewhat missing the forest for the trees; the issue isn't about "real-time" the issue is that many people conducting A/B tests don't understand what the statistics are telling them nor do they understand when an adequate "sample" has been pulled.
Real-time data isn't needed for A/B testing but this falls into the PEBKAC category.
Information is as useful as your ability to act on it--no more, no less. Real-time analytics is something that sounds sexy and gets a lot of headlines (and probably sales), but it's not particularly useful, especially compared to the cost to implement. Most organizations aren't capable of executing quick decisions of any significance. In fact, quite a few business models wouldn't have much to gain even if they were capable of it.
My experience is that there are three types of companies, with very little overlap:
1. Companies large enough to receive statistically significant amounts of data in under an hour.
2. Companies small enough to make decisions regarding significant site updates in under an hour.
3. Companies whose name is "Google."
Fact of the matter is, any change to your site more significant than changing a hex value will require time overhead to think up, spec out, test, and apply. Except in the most pathological cases of cowboy coding, it will take at least a day for minor changes. Changing, say, the page flow of your registration process will take a week to a month. You won't be re-allocating your multi-million-dollar media budget more often than once a quarter, and you have to plan it several months in advance anyways because you need to sign purchase orders.
In short, you can usually wait 'til tomorrow to get your data. Really, you can. Sure, you can probably stop an A/B test at the drop of a hat, but if it took you a week to build it, you ought to let it run longer than that.
I have had one client who really did benefit from real-time-ish (same-day) data. It was a large celebrity news site. They could use data from what stories were popular in the morning to decide which drivel to shovel out that afternoon. This exception nonetheless proves the rule: Of the 6 "requirements" listed in the article, only 1.5 were needed in this particular case: hard yes on accessibility, and timeliness was relaxed from 5 minutes to 30.
(Note that when I say analytics, I mean tools for making business decisions. Ops teams have use for real-time data collection, but the data they need is altogether different, and they are better served by specialized tools).