As a scientist who has moved into the business world, it amazes me how statistics are abused.
When I was conducting scientific research, the goal was to come up with an air-tight (or as air-tight as possible) case for your hypothesis. If you presented your findings at a meeting, you better be prepared for the onslaught of questions like "Did you consider X?" and "What about Y?".
Then I moved to the business side and holy crap are the standards lower. Of course it's easier to prove something in a lab than in the real world, but in so many cases I've seen somebody say "If you do X, you will get Y result, based on the data I analyzed". Then I raise my hand and say "But what about Z? That could explain your results." and all I get is blank stares like I just solved a differential wave function in my head.
Thanks so much for posting this! The perfect tool to irritate people with odd correlations. Reminds me on the good old Church of the Flying Spaghetti Monster "believe" in the correlation between Pirates' decline and global warming.
It's a great topic, especially for businesses and startups. I think the problem is much, much deeper than correlation != causation. The basic problem is that we don't understand how to deal with statistics, especially aggregate numbers. This is a funny way to make a point, but the problem is waaaay deeper than just confusion about correlation. The errors in scientific studies, for instance, are just one example of the harm caused by these kinds of cognitive blind spots. (I say blind spot instead of lack of math education because I don't believe the root problem is a lack of understanding math. In my opinion, something else is going on.)
There is a logical explanation. Drivers are spending more time sorting through lemons at the grocery store and end up driving less, thus traffic fatalities fall.
All these comments are true. But let's not make the similar error to say that correlations are bad.
Sometimes, knowing what caused something is the necessary answer, and for those, a root cause analysis and proper experimental design for validation are important. But sometimes, in business and in life, just knowing that things hang together can be pretty handy.
Correlations are important clues. The entire "recommendation" world, from Amazon's collaborative filtering to Hunch's "everything you might be interested in" are all predicated on correlations.
No argument, saying correlation implies causation is bad. But it's just as bad to say "therefore, correlation is bad". DanielBMarkham's article and this BW.com post both show that it comes down to interpretation of what the data says. It's understanding the limitations of what a number, or a trend, or even a distribution can reveal. It's understanding what regression to the mean actually means, or why we consider a distribution "normal"... and that outliers actually can be profitable.
And it's a recognition that with the democratization of big data, it will get worse before it gets better... but it will get better. 40 years ago, no one ever saw the stock market on the news, or had access to it's ups and downs every second. We now all have a better understanding of stocks (well, ok, that's a bit of a stretch, but you get my drift), and their dangers. Similarly, as we get used to seeing lots more data, and discovering that if you interpret it wrong, bad things happen... well, I expect more folks to ask that next level of questions. Not all, and not much past that... but it will be a start.
"There are three kinds of lies: lies, damned lies, and statistics."
I'd go so far as to say the problem is 100x more complicated than "Correlation != Causation". Given a set of factual statistics it's not terribly difficult to present them in a truthful, reasonable manner than support any side of a given argument.
Well, most people seem to forget that if you are looking for correlations among N variables, you can't compare each pair with the same standards as if you only had 2 variables. (Remember the recent article about neuroscience papers? Same thing.)
So the damned lie of statistics is pretty subtle, you just have to omit the number of variables you actually looked at when you present your data.
The field of statistics is very concerned with mis-representing data, and has developed a huge array of methods for dealing with uncertainty. It is important for people to appreciate that there are some minimum steps that must be taken for statistical analysis to claim validity. One of the first, which is almost universally ignored in pop-stats like the OP is state your assumptions, then justify them.
Stats are open to interpretation, which is why academia favors peer review, where faulty underlying assumptions can be checked.
I wish schools taught math leading to statistics and probability instead of leading to calculus. I believe that would much more useful for the average citizen.
> wish schools taught math leading to statistics and probability instead of leading to calculus
This is silly. All probability distributions are cadlag, so how can you even teach probability without the notion of right continous with left limits, which means you have to resort to limits & derivatives => Calc.
Actually, the argument for combining Calc & Stats is very compelling, because there is too much synergy. How can you teach a continous probability distribution like say the Gaussian without teaching how to integrate under the curve for the cumulative distribution function, or obtaing the probability density function via the derivative, or obtaining the variance aka second central moment via the moment generating function, which means you now have to teach atleast some fourier transforms which again means Calculus. At both UChicago & Stanford where I learnt all of my probability, calculus was quite intertwined with the teaching of probability. I believe its the same case in most other schools as well.
Without calc in probability, you can do "lame" stuff like discrete distributions ( Binomial, Poisson etc....but even there, the key insight is to show how the CDFs of the discrete distributions, which will generally have terribly complicated formulae with giant factorial expressions, can be very nicely approximated by the continous distributions for large n, small p etc. ( aka continous correction http://en.wikipedia.org/wiki/Continuity_correction ). So for a large number of coin flips trials, you use a Normal to approximate the CDF because otherwise the original binomial CDF is too hard to compute with your TI-84s (because you have one giant factorial divided by another giant factorial and the numerical overflows will kill the computation unless you are very careful about how you go about computing the result).
From what I can tell (and remember), elementary & high school math is specifically designed to take you from 0 to calculus. (well, maybe not ALL the way unless you take AP math)
Personally, I find basic stats and prob far more valuable in day to day life than calculus. So my point was just that I wish schools would focus on that area of math as the goal.
You are right in the sense you absolutely cannot get a deep understanding of Statistics without Calculus.
But with a mere background of high school algebra, you can learn more about Stats than most college graduates have, and that knowledge is far more relevant to the day-to-day lives of the average person in America than Calculus is.
You're talking college stats and calc. And even then Chicago and Stanford. It's like saying that Harvard's Math 55 should be the model for intro math courses in college.
And what Freedman's book does probably better than any other text in the field is teach how to think about statistics. It doesn't have a lot of formalisms, but if can come to an understanding of what he teaches in that book you'll have a rich understanding of stats.
With that said, if Calc was taught in the context of functions and probability, as in the Gemignani text then I think we'd be better off than how Calc today is focused around engineering.
In fact, this is the origin of the verb data-mining: to find whatever you need in data. Funny how it changed from a derogative to a respected --or at least well-payed-- practice.
You are over-simplifying and therefore trivialize what data-mining is. Data-mining is about deriving fact-based conclusions from complex information as an alternative to making decisions based on intuition or ignorance. Like almost anything complex, it can be done very poorly (as in the OP), or it can be done well. That doesn't mean that it originates in mis-representing information for the sake of 'finding whatever you need'.
This is a massive exaggeration. Of course you can find correlation over specific periods between random series. But when you're doing real analysis, the series you use aren't random (like the shape of a mountain). The idea is to draw an inference first, then see what the associated data says.
Of course, anyone beyond the base level of wisdom in this field understands this. It just annoys me that people attempt to diminish the value of statistics with an argument like this.
I don't think they are trying to diminish the value of statistics at all, but rather point out that it is easy to misunderstand or even deliberately abuse them.
This is more a warning to people without an understanding of statistics, because most people out there do not have a deep grasp of the fact that correlation does not imply causation.
Articles like this tend to elicit an interesting response from people. One one hand, many seem to believe that statistics = deceptive manipulation. One the other, many call for better statistics education in schools. Sometimes, both claims are made by the same individuals. So it seems that statistics education has at least two goals: explain what statistics actually is, and then explain how to do it correctly.
When I was conducting scientific research, the goal was to come up with an air-tight (or as air-tight as possible) case for your hypothesis. If you presented your findings at a meeting, you better be prepared for the onslaught of questions like "Did you consider X?" and "What about Y?".
Then I moved to the business side and holy crap are the standards lower. Of course it's easier to prove something in a lab than in the real world, but in so many cases I've seen somebody say "If you do X, you will get Y result, based on the data I analyzed". Then I raise my hand and say "But what about Z? That could explain your results." and all I get is blank stares like I just solved a differential wave function in my head.