Uhhhh, Evan Miller here. Not sure why my name is in the submitted title, but whatever.
The current selection on that page is somewhat limited, but I hope to grow it over time. The stuff at the beginning is pretty basic (e.g. standard deviation), but things get pretty gnarly by the time you get to the Kiefer equation. At some point I'll add some more references on how to implement things, e.g. find successive zeros of Bessel equations. For now it should be a good jumping-off point. Enjoy!
Well Evan Miller, I thank you for this anyway. I look forward to checking it out more soon.
I have been thinking about collecting some stories about how people have put their knowledge of stats when programming (performance testing, user patterns examination, etc), what do you think? It would be a kind of applied stats for programmers thing I guess.
If you're looking for things to add big-picture-wise, it might be helpful to specify what assumptions go into various tests/methods. In my experience, this is the biggest hangup and mistake, because a) it's more difficult to understand and b) ignoring it gives the appearance of rigor even if the test used is inappropriate for the data.
I should emphasize that this is not a nitpick or even a criticism, just a feature I would love to see. It's also what I spend a large portion of my time trying to track down, so having it in a convenient location would be nice.
Just as a rule of thumb, in general the most important assumption that is violated is independence. If the data is not independent and is instead positively correlated (the most likely form of correlation in practice) most equations will overestimate statistical significance.
A small amount of correlation can lead to very biased estimates of significance.
Correlation between observations will generally appear in any time series data or data that is arranged spatially.
Again, just a rule of thumb, but you should be very wary of the lack of independence between observations.
To add to this comment, large correlation between (or among) variables will often inflate estimates of standard error, leading to underestimates of statistical significance.
Great suggestion. I've been amazed to find out that many coders and amateur "data scientists" don't realize that testing the assumptions is an important part of conducting statistical analyses. Part of this may be due to the recent emphasis on machine learning techniques, which tend to be assumption-free (often just assuming independence of cases in the sample).
machine learning techniques, which tend to be assumption-free
ML should be a rigorous exercise in Bayesian and classical/frequentist stats, computational methods, dataset integrity, visualization etc, if you've been thru the texts by Murphy or Bishop. It often happens that people a couple years out of their last stats class only retain that high R-squared, p-, t- and f-values are what they're looking for, and heteroskedasticity and sphericity are just big words.
My evidence that ML is a rigorous exercise: the free texts listed (Barber, Mackay and Smola's are excellent, ESL not as accessible)
Thanks, @gtani, great resource. Yeah didn't mean to imply that ML techniques are free of ANY assumptions, just that several of the popular ones like logistic regression don't have distributional assumptions. (actually, I really want to understand the VC Inequality at some point, as it seems to allow us to make conclusions about out of sample error rates without depending on distributional assumptions)
For me formulas written in pseudocode are much easier to understand than classic mathematical notation.
For example I've learned bayesian classification, chi-square, etc. from Practical Common Lisp (http://www.gigamonkeys.com/book/practical-a-spam-filter.html) after failing to understand how to apply formulas from Wikipedia. It was easier for me to learn Lisp than to decipher abstract declarative mathematical notation (admittedly I get a brain freeze whenever I see ∑, even though I know what it means. I prefer `for(…) acc += …`).
The first example, "unbiased standard deviation" is mislabeled. The estimator of the variance is unbiased but the square root of an unbiased estimator is not itself unbiased. So it's not as nit-picky as it looks; it's either a brain fart or a hole in understanding (especially since the linked Wikipedia page discusses this issue)[1,2]
Not to be a dick, but getting the first example wrong like that doesn't inspire confidence in the rest of the post.
Seems more like a brain fart than an error in understanding, especially because he didn't highlight why it was important to have unbiasedness in the first place. Having to explain the unbiased estimator for standard deviation would probably not have contributed much. It's more of a technicality than anything of philosophical importance--the square root of the unbiased estimator of sample variance is pretty accurate.
The usage of n-1 instead of n in the denominator was a question I was always asked when I TA'ed for intro statistics classes in grad school. An explanation of unbiasedness might be warranted if this is to be an introductory primer.
This was the first time I'd ever heard about biased/unbiased sample variances, all inspired by the original web page and seeing the (N-1) in the standard deviation formula. This inspired me to go off and read about Bessel's correction in wikipedia.
Thank you, Evan Miller! Your web page is great, it taught me new stuff from the very first formula. Much appreciated.
Hey Evan, from one statistics guy to another, thanks for fighting the good fight :).
The formulas might benefit from examples, especially with some of the more complicated cases (KS test and onwards). The important part of statistics comes from knowing _when_ to apply something, rather than _how_ to (that part is just math/numerical analysis).
A mention of the assumptions of each of these intervals would be good, too. Too often I see conclusions invalidated by using a probability model that doesn't make sense. This is a common failure I see with using a Wald interval for the slope of the regression line.
Agreed 100%! As the introduction says this is supposed to be a "cheat-sheet" and a jumping-off point for further discovery... I'd love to see (and write!) more posts on when to apply things.
Yankoff, you might want to be more specific. Intro statistics in general, or for computer scientists, or scientists, or looking to learn R at the same time? I liked Freedman, Pisani, and Purves [1], and have TA'ed using McClave, Sincich, and Mendenhall [2]. You may want something a little more advanced than these, but they are pretty good for intro level.
Yeah, I meant something for computer scientists. I'm going over coursera ML course currently and wanted to learn at least basics of statistics in parallel.
You will find the intro books don't talk much about parallel computing. Most of the general data sets in intro books will be no more than 30 observations. They are trying to teach classical methods moreso than useful computational techniques. As for parallel statistics, I don't have a good book recommendation. Most of my knowledge on the topic comes from papers and vignettes from the R community and not books. Maybe check out one of those O'Reilly books about big data techniques?
I haven't seen this OpenIntro statistics before. I'll check it out!
Very intro (like, Stats 101 intro): Purves, Pisani, and Freedman's is the best book I've seen. I'd combine that with something like Tufte's or Wainer's statistical graphics books to try to get some sophistication (for lack of a better word).
The e-Handbook won't make you an expert statistician, but as an engineer needing to understand and apply statistical methods, I've found it to be a good starting point.
As a programmer and statistician I should warn newcomers about the assumptions of tests (parametric tests like test t). Test t for example need same variance and normality or you'll take wrong decisions. These formulas and p-values shouldn't be used as substitutes for graphics (you should check them first and test later).
Nice but little knowledge is a dangerous thing. It is probably safer and more effective for non-statistician "data scientists" to use Robust Statistics: https://en.wikipedia.org/wiki/Robust_statistics
The wikipedia article talks much about dealing with outliers. How can outliers be removed/replaced or handled differently as the article suggests? Are the outliers not part of the data, after all? It seems like the goal of 'improving performance' here involves tweaking the data to get the results you want. What have I misunderstood here?
The main difference between classical regression using ordinary least squares (OLS) and robust regression using iterative re-weighted least squares (IRLS) is that with OLS, all observations are given equal weight and with IRLS, observations may or may not be given equal weight. Essentially, IRLS gives outliers and/or influential [1] data points less weight, which may improve the performance of the overall model since these outliers/influential data would otherwise cause assumption violations using classical regression. If there are no outliers, then results from robust and classical regression converge.
I would disagree with SagelyGuru in recommending robust regression for non-statisticians, though I can see where he or she is coming from. With robust regression, you don't have to worry as much about assumptions as with classical regression. But with robust regression, you need to be aware that the underlying analytical method is different and what that means. For example, the standard robust regression implementation in R (i.e, the rlm function in the MASS package) doesn't produce t-statistics or p-values. There're also warnings that especially at lower sample sizes, the standard errors produced by rlm may be unreliable. One recommended way to obtain those p-values would be to get bootstrapped standard error estimates, so that normal-theory approximation would apply.
[1] There are different types of robust estimators (e.g., M, S, MM, etc.) that have different robustness properties.
A nice resource, however I'd expected to see some actual programming on this page. Perhaps some sample code in a few languages would be nice. You could put tabs on a code box to switch between a python and PHP implementation for example.
I think if trying to explain things like these to programmers, maybe you should consider using actual code (or even pseudo code) for this? It would have an added benefit of not requiring to know math-english (for non-native english speakers like me) or any advanced concepts of math at all.
Great effort, and I certainly hope more coders will get into statistics (most I know are only interested in machine learning). However, I think your definition of 1.3 "Confidence Interval around the Mean" could be improved. You state:
"A confidence interval reflects the set of statistical hypotheses that won't be rejected at a given significance level. So the confidence interval around the mean reflects all possible values of the mean that can't be rejected by the data."
That seems a bit vague and perhaps confusing. Might I suggest something more like this:
"The confidence interval specifies a range (+/- a multiple of the above standard error [SE]) around our estimate of the mean (x-bar) such that: if we repeated our sampling process an infinite number of times (i.e. with the same sample size and forming a new x-bar and SE each time [and therefore, a new confidence interval]), Confidence_Level% of those intervals would contain the population (true) mean."
In addition, I think in this case, at least, there are no assumptions about the data to worry about, given a sufficiently moderate sample size due to the Central Limit Theorem (I'm confident about that in the case of the mean (x-bar), but I'll leave it up to others to correct me if I'm wrong about this applying to the standard error (SE)).
I feel this would be quite confusing for an average programmer. It is more like a cheat sheet for people who have some statistical training but always have to look the formulas up because they don't use them frequently enough. For an average programmer, really understanding how linear regression works and some basic linear algebra would be a good start. A lot of programmers have trouble even with these "simple" topics.
Most of these formulas are very rarely used even by quantitative analysts. The most used are for standard deviation and regression. The more complicated ones are generally used as a part of statistical routines, say, in R. It is very rare that someone has to code them.
> From a statistical point of view, 5 events is
> indistinguishable from 7 events.
What is this supposed to mean? There is a concept of statistical significance but if an effect is not statistically significant it does not follow that it does not exist. Btw where is the Bayes formula? :)
This is a good motivation, but it's plunging right into formulas, which is wrong-headed. Even the mean can be completely meaningless without eyeballing a plot of the data. Given that this is aimed at non-statisticians, it's critical that these points are made in Big Flashing Letters before we start handing out formulas that make people act like they have "superpowers".
The first superpower is to look at the data and see if it makes sense using your eyes and your brain, not to start spewing confidence intervals.
As other posters have pointed out, it is even more irresponsible to start waving around things like the t-test without discussing the parametric assumptions that these things depend on for their validity.
Great start on an important topic. Quick extra info - For drawing a trend line it is often useful to have the intercept as well. Using y=mx+b line notation the best fit intercept is: \hat{b} = \bar{y} - m * \bar{x} [1]
Be great to see some pictures to illustrate the formulas and some mention of robust statistics as I find outliers to be a huge issue in application of statistical techniques.
I've wondered for a long time if there's a way to condense certain statistical (and probability) information into a single How-Much-Should-I-Care number. Can someone shed some light on this?
To pick a couple examples from health news in the popular press:
NB: I'm making up all the numbers here for the sake of example.
(1) A study shows that people who consume more than 10g of added salt a day live shorter lives.
But how much shorter? If it's 30 minutes shorter, I don't care about the study and I'm not going to change my behavior. If it's 6 months longer, then I'm interested and might very well do something.
(2) A study shows that people who drink 2 or more cups of coffee a day have lower risk of Alzheimer's Disease.
But how much lower risk? If the average lifetime risk is 1 in 50, and drinking coffee lowers it to 1 in 49.997, then I don't want to waste time even reading the article. If it lowers it to 1 in a 1000, then yes, I might change my behavior.
So, in the above examples, is there any way to reduce the information into a single How-Much-Should-I-Care number?
Like this:
(1) A study shows that people who consume more than 10g of added salt a day have an ____x____ factor shorter life.
(2) A study shows that people who drink 2 or more cups of coffee a day have a ____y____ factor lower risk of Alzheimer's Disease.
Then, by looking at x and y, I can tell at a glance whether some result is irrelevant, trivial, useful, or groundbreaking. I understand that it'll still be subjective in the end -- like whether $1, $10, $1000, or $10,000,000 seems like a lot of money to an individual -- but at least it'll be one number.
Just a factor is often not enough. If a study would show that cell phone usage increased the risk of a certain cancer with 40%, that might still not be interesting if it's an extremely rare cancer and they found 7 cases instead of 5.
Or it doesn't show whether various correlation factors matter, or whether this is a statistical paradox. Did you know that babies of smokers are healthier than babies of non-smokers of the same weight? This is because baby born to the smoker will have decreased weight because of the smoking, whereas if the baby of the non-smoker is underweight it will be for other reasons that are often worse.
There's heaps of these kind of paradoxes and pitfalls that need to be taken in account.
What we need in newspapers and other media is a simplified abstract of the paper and an explanation or approval of a real life statistician, with no relation to the study.
There are many other ways to accomplish what you're talking about. The biggest problem with your made-up examples is that they are just cases of "bad reporting."
I used it extensively for the development of different trading bots and algorithms. They say you need to be a real hotshot at maths to make it in that field. Little do they know that this world will soon belong to script-kiddies and hackers! Here's one of my favorites from them, this will rearrange any complex equation and make anything you like the subject:
Ugh recipe-book statistics! This might be useful but not to programmers. No-one should be applying any of the more complex formulas on this page without a good basic grounding in the theory, which is something nearly all programmers will lack (and fair enough). The best place for programmers to start is understanding the basics of likelihood functions, maximum likelihood estimation, Bayes rule and likelihood ratio tests. Then look at e.g. the derivation of the chi-squared test statistic as an approximation to a likelihood ratio test of multinomial distributions to get a sense of where all the mysterious formulas of classical statistics come from.
Maybe this solidifies the fact that I'm a programmer and not a statistician, but I got lost after the 2nd hyperbole in the beginning...but I like a challenge, I may have to read it multiple times until your presentation sinks in though :)
Great blog! I wish there was a cookbook style code with eac formula. In C or even JavaScript for those of us who have forgotten a lot of math but can think in code.
The current selection on that page is somewhat limited, but I hope to grow it over time. The stuff at the beginning is pretty basic (e.g. standard deviation), but things get pretty gnarly by the time you get to the Kiefer equation. At some point I'll add some more references on how to implement things, e.g. find successive zeros of Bessel equations. For now it should be a good jumping-off point. Enjoy!