Statistical Formulas For Programmers

EvanMiller · on May 20, 2013

Uhhhh, Evan Miller here. Not sure why my name is in the submitted title, but whatever.

The current selection on that page is somewhat limited, but I hope to grow it over time. The stuff at the beginning is pretty basic (e.g. standard deviation), but things get pretty gnarly by the time you get to the Kiefer equation. At some point I'll add some more references on how to implement things, e.g. find successive zeros of Bessel equations. For now it should be a good jumping-off point. Enjoy!

ninetax · on May 20, 2013

Well Evan Miller, I thank you for this anyway. I look forward to checking it out more soon.

I have been thinking about collecting some stories about how people have put their knowledge of stats when programming (performance testing, user patterns examination, etc), what do you think? It would be a kind of applied stats for programmers thing I guess.

salimmadjd · on May 20, 2013

HN displays the URL's domain. Which happens to be your name

pseut · on May 20, 2013

The title had the name in it too originally but it seems that a moderator changed it.

kevinalexbrown · on May 20, 2013

If you're looking for things to add big-picture-wise, it might be helpful to specify what assumptions go into various tests/methods. In my experience, this is the biggest hangup and mistake, because a) it's more difficult to understand and b) ignoring it gives the appearance of rigor even if the test used is inappropriate for the data.

I should emphasize that this is not a nitpick or even a criticism, just a feature I would love to see. It's also what I spend a large portion of my time trying to track down, so having it in a convenient location would be nice.

timr · on May 20, 2013

This guy has written up many of the most common statistical tests, their interpretations, and their assumptions in a very human-readable way:

http://udel.edu/~mcdonald/statintro.html

I point a lot of newbs to pages on that site so that they can develop a better intuition for the methods.

buzzkillr2 · on May 20, 2013

This is an amazing resource thank you.

I wish I had this last semester during my statistics course.

timr · on May 20, 2013

If you're looking for more of this, but in greater depth, pick up a copy of Biometry.

It's written for biologists, but you don't really need to know much biology to work through the examples, and the focus is inherently practical.

scottfr · on May 20, 2013

Just as a rule of thumb, in general the most important assumption that is violated is independence. If the data is not independent and is instead positively correlated (the most likely form of correlation in practice) most equations will overestimate statistical significance.

A small amount of correlation can lead to very biased estimates of significance.

Correlation between observations will generally appear in any time series data or data that is arranged spatially.

Again, just a rule of thumb, but you should be very wary of the lack of independence between observations.

vwinsyee · on May 20, 2013

To add to this comment, large correlation between (or among) variables will often inflate estimates of standard error, leading to underestimates of statistical significance.

scottedwards · on May 20, 2013

Great suggestion. I've been amazed to find out that many coders and amateur "data scientists" don't realize that testing the assumptions is an important part of conducting statistical analyses. Part of this may be due to the recent emphasis on machine learning techniques, which tend to be assumption-free (often just assuming independence of cases in the sample).

yummyfajitas · on May 20, 2013

Part of this may be due to the recent emphasis on machine learning techniques, which tend to be assumption-free...

No statistical technique is assumption free, unless it is purely descriptive.

Some of them are free of explicit assumptions known by the practitioner, but that's not the same thing. In much the same way, my code is all bug-free.

gtani · on May 20, 2013

    machine learning techniques, which tend to be assumption-free

ML should be a rigorous exercise in Bayesian and classical/frequentist stats, computational methods, dataset integrity, visualization etc, if you've been thru the texts by Murphy or Bishop. It often happens that people a couple years out of their last stats class only retain that high R-squared, p-, t- and f-values are what they're looking for, and heteroskedasticity and sphericity are just big words.

My evidence that ML is a rigorous exercise: the free texts listed (Barber, Mackay and Smola's are excellent, ESL not as accessible)

http://metaoptimize.com/qa/questions/186/good-freely-availab...

scottedwards · on May 20, 2013

Thanks, @gtani, great resource. Yeah didn't mean to imply that ML techniques are free of ANY assumptions, just that several of the popular ones like logistic regression don't have distributional assumptions. (actually, I really want to understand the VC Inequality at some point, as it seems to allow us to make conclusions about out of sample error rates without depending on distributional assumptions)

pornel · on May 20, 2013

I hoped for it to be more "for programmers", like this one:

http://gdr.geekhood.net/gdrwpl/metnum.php

For me formulas written in pseudocode are much easier to understand than classic mathematical notation.

For example I've learned bayesian classification, chi-square, etc. from Practical Common Lisp (http://www.gigamonkeys.com/book/practical-a-spam-filter.html) after failing to understand how to apply formulas from Wikipedia. It was easier for me to learn Lisp than to decipher abstract declarative mathematical notation (admittedly I get a brain freeze whenever I see ∑, even though I know what it means. I prefer `for(…) acc += …`).

Ixiaus · on May 20, 2013

This is off-topic; but I couldn't help notice your user name. My RL given name is "Parnell" and people often mis-pronounce it as "Pornell"...

pseut · on May 20, 2013

The first example, "unbiased standard deviation" is mislabeled. The estimator of the variance is unbiased but the square root of an unbiased estimator is not itself unbiased. So it's not as nit-picky as it looks; it's either a brain fart or a hole in understanding (especially since the linked Wikipedia page discusses this issue)[1,2]

Not to be a dick, but getting the first example wrong like that doesn't inspire confidence in the rest of the post.

[1] https://en.wikipedia.org/wiki/Standard_deviation

[2] https://en.wikipedia.org/wiki/Unbiased_estimation_of_standar...

christopheraden · on May 20, 2013

Seems more like a brain fart than an error in understanding, especially because he didn't highlight why it was important to have unbiasedness in the first place. Having to explain the unbiased estimator for standard deviation would probably not have contributed much. It's more of a technicality than anything of philosophical importance--the square root of the unbiased estimator of sample variance is pretty accurate.

The usage of n-1 instead of n in the denominator was a question I was always asked when I TA'ed for intro statistics classes in grad school. An explanation of unbiasedness might be warranted if this is to be an introductory primer.

EvanMiller · on May 20, 2013

Thanks! I have relabeled it. (In a previous draft the entry was for unbiased variance and the inaccuracy slipped in during the transition.)

pseut · on May 20, 2013

Your post said, "draft," so I think you're covered for that sort of error. :)

joosters · on May 20, 2013

This was the first time I'd ever heard about biased/unbiased sample variances, all inspired by the original web page and seeing the (N-1) in the standard deviation formula. This inspired me to go off and read about Bessel's correction in wikipedia.

Thank you, Evan Miller! Your web page is great, it taught me new stuff from the very first formula. Much appreciated.

bjhoops1 · on May 20, 2013

I always make a point of reading comments that start with "Not to be a dick..." :P

christopheraden · on May 20, 2013

Hey Evan, from one statistics guy to another, thanks for fighting the good fight :). The formulas might benefit from examples, especially with some of the more complicated cases (KS test and onwards). The important part of statistics comes from knowing _when_ to apply something, rather than _how_ to (that part is just math/numerical analysis). A mention of the assumptions of each of these intervals would be good, too. Too often I see conclusions invalidated by using a probability model that doesn't make sense. This is a common failure I see with using a Wald interval for the slope of the regression line.

EvanMiller · on May 20, 2013

Agreed 100%! As the introduction says this is supposed to be a "cheat-sheet" and a jumping-off point for further discovery... I'd love to see (and write!) more posts on when to apply things.

yankoff · on May 20, 2013

While you guys are here, can you recommend a good intro book for statistics?

christopheraden · on May 20, 2013

Yankoff, you might want to be more specific. Intro statistics in general, or for computer scientists, or scientists, or looking to learn R at the same time? I liked Freedman, Pisani, and Purves [1], and have TA'ed using McClave, Sincich, and Mendenhall [2]. You may want something a little more advanced than these, but they are pretty good for intro level.

[1]: http://www.amazon.com/Statistics-4th-David-Freedman/dp/03939... [2]: http://www.amazon.com/Statistics-11th-Edition-Book-CD/dp/013... [3]: http://stats.stackexchange.com/questions/421/what-book-would...

yankoff · on May 20, 2013

Yeah, I meant something for computer scientists. I'm going over coursera ML course currently and wanted to learn at least basics of statistics in parallel.

Thanks, I'll check out your links.

Btw, what do you think of OpenIntro statistics?http://www.openintro.org/stat/down/OpenIntroStatSecond.pdf

christopheraden · on May 20, 2013

You will find the intro books don't talk much about parallel computing. Most of the general data sets in intro books will be no more than 30 observations. They are trying to teach classical methods moreso than useful computational techniques. As for parallel statistics, I don't have a good book recommendation. Most of my knowledge on the topic comes from papers and vignettes from the R community and not books. Maybe check out one of those O'Reilly books about big data techniques?

I haven't seen this OpenIntro statistics before. I'll check it out!

mattrepl · on May 20, 2013

You may like Wasserman's All of Statistics:

http://www.stat.cmu.edu/~larry/all-of-statistics/

It was written as an introduction to statistics for people in CS and related fields.

vdm · on May 20, 2013

That OpenIntro pdf looks great! Thanks for the link.

pseut · on May 20, 2013

Very intro (like, Stats 101 intro): Purves, Pisani, and Freedman's is the best book I've seen. I'd combine that with something like Tufte's or Wainer's statistical graphics books to try to get some sophistication (for lack of a better word).

wsh · on May 20, 2013

I like the NIST/SEMATECH e-Handbook of Statistical Methods (http://www.itl.nist.gov/div898/handbook/).

The e-Handbook won't make you an expert statistician, but as an engineer needing to understand and apply statistical methods, I've found it to be a good starting point.

alanthonyc · on May 20, 2013

Check out the Intro to Stats course on Udacity:

https://www.udacity.com/course/st101

Just took it earlier this year. It was informative and I enjoyed the class.

durbatuluk · on May 20, 2013

As a programmer and statistician I should warn newcomers about the assumptions of tests (parametric tests like test t). Test t for example need same variance and normality or you'll take wrong decisions. These formulas and p-values shouldn't be used as substitutes for graphics (you should check them first and test later).

Just take care, there's a lot of problems from using statistics in wrong way. special care with small sample size and even large ones [http://scienceblogs.com/mixingmemory/2006/10/31/jeffreylindl...]

SagelyGuru · on May 20, 2013

Nice but little knowledge is a dangerous thing. It is probably safer and more effective for non-statistician "data scientists" to use Robust Statistics: https://en.wikipedia.org/wiki/Robust_statistics

platz · on May 20, 2013

The wikipedia article talks much about dealing with outliers. How can outliers be removed/replaced or handled differently as the article suggests? Are the outliers not part of the data, after all? It seems like the goal of 'improving performance' here involves tweaking the data to get the results you want. What have I misunderstood here?

vwinsyee · on May 20, 2013

The main difference between classical regression using ordinary least squares (OLS) and robust regression using iterative re-weighted least squares (IRLS) is that with OLS, all observations are given equal weight and with IRLS, observations may or may not be given equal weight. Essentially, IRLS gives outliers and/or influential [1] data points less weight, which may improve the performance of the overall model since these outliers/influential data would otherwise cause assumption violations using classical regression. If there are no outliers, then results from robust and classical regression converge.

I would disagree with SagelyGuru in recommending robust regression for non-statisticians, though I can see where he or she is coming from. With robust regression, you don't have to worry as much about assumptions as with classical regression. But with robust regression, you need to be aware that the underlying analytical method is different and what that means. For example, the standard robust regression implementation in R (i.e, the rlm function in the MASS package) doesn't produce t-statistics or p-values. There're also warnings that especially at lower sample sizes, the standard errors produced by rlm may be unreliable. One recommended way to obtain those p-values would be to get bootstrapped standard error estimates, so that normal-theory approximation would apply.

[1] There are different types of robust estimators (e.g., M, S, MM, etc.) that have different robustness properties.

platz · on May 20, 2013

Thanks for the explanation - it was helpful

ExpiredLink · on May 20, 2013

Mere 'formulas' without theory are dangerous.

deanc · on May 20, 2013

A nice resource, however I'd expected to see some actual programming on this page. Perhaps some sample code in a few languages would be nice. You could put tabs on a code box to switch between a python and PHP implementation for example.

esalman · on May 20, 2013

Same here. Theory, formula, assumptions etc. are available in a textbook, Wikipedia and elsewhere.

This could be useful- PHP stats functions: http://www.php.net/manual/en/ref.stats.php

dreen · on May 20, 2013

I think if trying to explain things like these to programmers, maybe you should consider using actual code (or even pseudo code) for this? It would have an added benefit of not requiring to know math-english (for non-native english speakers like me) or any advanced concepts of math at all.

scottedwards · on May 20, 2013

Great effort, and I certainly hope more coders will get into statistics (most I know are only interested in machine learning). However, I think your definition of 1.3 "Confidence Interval around the Mean" could be improved. You state:

"A confidence interval reflects the set of statistical hypotheses that won't be rejected at a given significance level. So the confidence interval around the mean reflects all possible values of the mean that can't be rejected by the data."

That seems a bit vague and perhaps confusing. Might I suggest something more like this:

"The confidence interval specifies a range (+/- a multiple of the above standard error [SE]) around our estimate of the mean (x-bar) such that: if we repeated our sampling process an infinite number of times (i.e. with the same sample size and forming a new x-bar and SE each time [and therefore, a new confidence interval]), Confidence_Level% of those intervals would contain the population (true) mean."

In addition, I think in this case, at least, there are no assumptions about the data to worry about, given a sufficiently moderate sample size due to the Central Limit Theorem (I'm confident about that in the case of the mean (x-bar), but I'll leave it up to others to correct me if I'm wrong about this applying to the standard error (SE)).

bearmf · on May 20, 2013

Confidence intervals are inherently confusing. I have yet to hear a definition that is both correct and easily understood and remembered.

bearmf · on May 20, 2013

I feel this would be quite confusing for an average programmer. It is more like a cheat sheet for people who have some statistical training but always have to look the formulas up because they don't use them frequently enough. For an average programmer, really understanding how linear regression works and some basic linear algebra would be a good start. A lot of programmers have trouble even with these "simple" topics.

Most of these formulas are very rarely used even by quantitative analysts. The most used are for standard deviation and regression. The more complicated ones are generally used as a part of statistical routines, say, in R. It is very rare that someone has to code them.

> From a statistical point of view, 5 events is > indistinguishable from 7 events. What is this supposed to mean? There is a concept of statistical significance but if an effect is not statistically significant it does not follow that it does not exist. Btw where is the Bayes formula? :)

onan_barbarian · on May 21, 2013

This is a good motivation, but it's plunging right into formulas, which is wrong-headed. Even the mean can be completely meaningless without eyeballing a plot of the data. Given that this is aimed at non-statisticians, it's critical that these points are made in Big Flashing Letters before we start handing out formulas that make people act like they have "superpowers".

The first superpower is to look at the data and see if it makes sense using your eyes and your brain, not to start spewing confidence intervals.

As other posters have pointed out, it is even more irresponsible to start waving around things like the t-test without discussing the parametric assumptions that these things depend on for their validity.

chuckcode · on May 20, 2013

Great start on an important topic. Quick extra info - For drawing a trend line it is often useful to have the intercept as well. Using y=mx+b line notation the best fit intercept is: \hat{b} = \bar{y} - m * \bar{x} [1]

Be great to see some pictures to illustrate the formulas and some mention of robust statistics as I find outliers to be a huge issue in application of statistical techniques.

[1] http://en.wikipedia.org/wiki/Simple_linear_regression

cantrevealname · on May 20, 2013

I've wondered for a long time if there's a way to condense certain statistical (and probability) information into a single How-Much-Should-I-Care number. Can someone shed some light on this?

To pick a couple examples from health news in the popular press:

NB: I'm making up all the numbers here for the sake of example.

(1) A study shows that people who consume more than 10g of added salt a day live shorter lives.

But how much shorter? If it's 30 minutes shorter, I don't care about the study and I'm not going to change my behavior. If it's 6 months longer, then I'm interested and might very well do something.

(2) A study shows that people who drink 2 or more cups of coffee a day have lower risk of Alzheimer's Disease.

But how much lower risk? If the average lifetime risk is 1 in 50, and drinking coffee lowers it to 1 in 49.997, then I don't want to waste time even reading the article. If it lowers it to 1 in a 1000, then yes, I might change my behavior.

So, in the above examples, is there any way to reduce the information into a single How-Much-Should-I-Care number?

Like this:

(1) A study shows that people who consume more than 10g of added salt a day have an ____x____ factor shorter life.

(2) A study shows that people who drink 2 or more cups of coffee a day have a ____y____ factor lower risk of Alzheimer's Disease.

Then, by looking at x and y, I can tell at a glance whether some result is irrelevant, trivial, useful, or groundbreaking. I understand that it'll still be subjective in the end -- like whether $1, $10, $1000, or $10,000,000 seems like a lot of money to an individual -- but at least it'll be one number.

kmm · on May 20, 2013

Just a factor is often not enough. If a study would show that cell phone usage increased the risk of a certain cancer with 40%, that might still not be interesting if it's an extremely rare cancer and they found 7 cases instead of 5.

Or it doesn't show whether various correlation factors matter, or whether this is a statistical paradox. Did you know that babies of smokers are healthier than babies of non-smokers of the same weight? This is because baby born to the smoker will have decreased weight because of the smoking, whereas if the baby of the non-smoker is underweight it will be for other reasons that are often worse.

There's heaps of these kind of paradoxes and pitfalls that need to be taken in account.

What we need in newspapers and other media is a simplified abstract of the paper and an explanation or approval of a real life statistician, with no relation to the study.

idm · on May 20, 2013

See https://en.wikipedia.org/wiki/Odds_ratio for one answer. Another "single number" answer is generated by the class of tests that determine the statistical significance of an observation; see https://en.wikipedia.org/wiki/Alpha_level for more about this.

There are many other ways to accomplish what you're talking about. The biggest problem with your made-up examples is that they are just cases of "bad reporting."

Sealy · on May 20, 2013

Great resource. For anybody struggling with ANY formula... or even maths class. I'd encourage you to learn how Wolfram Alpha works:

http://www.wolframalpha.com/

I used it extensively for the development of different trading bots and algorithms. They say you need to be a real hotshot at maths to make it in that field. Little do they know that this world will soon belong to script-kiddies and hackers! Here's one of my favorites from them, this will rearrange any complex equation and make anything you like the subject:

http://www.wolframalpha.com/widgets/view.jsp?id=4be4308d0f9d...

The one thing you will have to learn is how to represent an equation in text based form, ie to use ^ to signify a power etc...

bcbrown · on May 21, 2013

Man, that would have been useful in college.

Sealy · on May 22, 2013

Check this out for a demo of what it was designed to do.

This will solve the following complex equation (Find both X and Y):

x+y=10, x-y=4

http://www.wolframalpha.com/input/?i=x%2By%3D10%2C+x-y%3D4&#...

... and thats just scratching the surface. Kids studying maths these days don't know how good they have it. I struggled so much when I was younger.

Myrmornis · on May 21, 2013

Ugh recipe-book statistics! This might be useful but not to programmers. No-one should be applying any of the more complex formulas on this page without a good basic grounding in the theory, which is something nearly all programmers will lack (and fair enough). The best place for programmers to start is understanding the basics of likelihood functions, maximum likelihood estimation, Bayes rule and likelihood ratio tests. Then look at e.g. the derivation of the chi-squared test statistic as an approximation to a likelihood ratio test of multinomial distributions to get a sense of where all the mysterious formulas of classical statistics come from.

quackerhacker · on May 20, 2013

Thanks for writing this Evan.

Maybe this solidifies the fact that I'm a programmer and not a statistician, but I got lost after the 2nd hyperbole in the beginning...but I like a challenge, I may have to read it multiple times until your presentation sinks in though :)

th0114nd · on May 20, 2013

The link to "how-to-read-an-unlabeled-sales-chart.html" has an extra s after unlabeled.

Ixiaus · on May 20, 2013

Tangent: Why is your product not a SaaS offering!? (I would pay for it)

salimmadjd · on May 20, 2013

Great blog! I wish there was a cookbook style code with eac formula. In C or even JavaScript for those of us who have forgotten a lot of math but can think in code.

tieTYT · on May 20, 2013

I like the spirit of this article, but I wish it put much more emphasis on explaining why/when you'd need this. Examples would be very helpful.

sidcool · on May 20, 2013

Isn't the plural form of formula called formulae?

pseut · on May 20, 2013

Either formulae and formulas are pretty common (and my Firefox spellcheck is flagging the first and not the second FWIW).

sidcool · on May 20, 2013

Same here, firefox flags formulae.

novaleaf · on May 21, 2013

christ, if your going to post statistical formulas for programmers, post it as an algorithm, not as a math formula.

silvertonia · on May 20, 2013

Evan Miller, I love you.