Hacker News new | past | comments | ask | show | jobs | submit login
Statistical Formulas For Programmers (evanmiller.org)
395 points by bajames on May 20, 2013 | hide | past | favorite | 64 comments



Uhhhh, Evan Miller here. Not sure why my name is in the submitted title, but whatever.

The current selection on that page is somewhat limited, but I hope to grow it over time. The stuff at the beginning is pretty basic (e.g. standard deviation), but things get pretty gnarly by the time you get to the Kiefer equation. At some point I'll add some more references on how to implement things, e.g. find successive zeros of Bessel equations. For now it should be a good jumping-off point. Enjoy!


Well Evan Miller, I thank you for this anyway. I look forward to checking it out more soon.

I have been thinking about collecting some stories about how people have put their knowledge of stats when programming (performance testing, user patterns examination, etc), what do you think? It would be a kind of applied stats for programmers thing I guess.


HN displays the URL's domain. Which happens to be your name


The title had the name in it too originally but it seems that a moderator changed it.


If you're looking for things to add big-picture-wise, it might be helpful to specify what assumptions go into various tests/methods. In my experience, this is the biggest hangup and mistake, because a) it's more difficult to understand and b) ignoring it gives the appearance of rigor even if the test used is inappropriate for the data.

I should emphasize that this is not a nitpick or even a criticism, just a feature I would love to see. It's also what I spend a large portion of my time trying to track down, so having it in a convenient location would be nice.


This guy has written up many of the most common statistical tests, their interpretations, and their assumptions in a very human-readable way:

http://udel.edu/~mcdonald/statintro.html

I point a lot of newbs to pages on that site so that they can develop a better intuition for the methods.


This is an amazing resource thank you.

I wish I had this last semester during my statistics course.


If you're looking for more of this, but in greater depth, pick up a copy of Biometry.

It's written for biologists, but you don't really need to know much biology to work through the examples, and the focus is inherently practical.


Just as a rule of thumb, in general the most important assumption that is violated is independence. If the data is not independent and is instead positively correlated (the most likely form of correlation in practice) most equations will overestimate statistical significance.

A small amount of correlation can lead to very biased estimates of significance.

Correlation between observations will generally appear in any time series data or data that is arranged spatially.

Again, just a rule of thumb, but you should be very wary of the lack of independence between observations.


To add to this comment, large correlation between (or among) variables will often inflate estimates of standard error, leading to underestimates of statistical significance.


Great suggestion. I've been amazed to find out that many coders and amateur "data scientists" don't realize that testing the assumptions is an important part of conducting statistical analyses. Part of this may be due to the recent emphasis on machine learning techniques, which tend to be assumption-free (often just assuming independence of cases in the sample).


Part of this may be due to the recent emphasis on machine learning techniques, which tend to be assumption-free...

No statistical technique is assumption free, unless it is purely descriptive.

Some of them are free of explicit assumptions known by the practitioner, but that's not the same thing. In much the same way, my code is all bug-free.


    machine learning techniques, which tend to be assumption-free 
ML should be a rigorous exercise in Bayesian and classical/frequentist stats, computational methods, dataset integrity, visualization etc, if you've been thru the texts by Murphy or Bishop. It often happens that people a couple years out of their last stats class only retain that high R-squared, p-, t- and f-values are what they're looking for, and heteroskedasticity and sphericity are just big words.

My evidence that ML is a rigorous exercise: the free texts listed (Barber, Mackay and Smola's are excellent, ESL not as accessible)

http://metaoptimize.com/qa/questions/186/good-freely-availab...


Thanks, @gtani, great resource. Yeah didn't mean to imply that ML techniques are free of ANY assumptions, just that several of the popular ones like logistic regression don't have distributional assumptions. (actually, I really want to understand the VC Inequality at some point, as it seems to allow us to make conclusions about out of sample error rates without depending on distributional assumptions)


I hoped for it to be more "for programmers", like this one:

http://gdr.geekhood.net/gdrwpl/metnum.php

For me formulas written in pseudocode are much easier to understand than classic mathematical notation.

For example I've learned bayesian classification, chi-square, etc. from Practical Common Lisp (http://www.gigamonkeys.com/book/practical-a-spam-filter.html) after failing to understand how to apply formulas from Wikipedia. It was easier for me to learn Lisp than to decipher abstract declarative mathematical notation (admittedly I get a brain freeze whenever I see ∑, even though I know what it means. I prefer `for(…) acc += …`).


This is off-topic; but I couldn't help notice your user name. My RL given name is "Parnell" and people often mis-pronounce it as "Pornell"...


The first example, "unbiased standard deviation" is mislabeled. The estimator of the variance is unbiased but the square root of an unbiased estimator is not itself unbiased. So it's not as nit-picky as it looks; it's either a brain fart or a hole in understanding (especially since the linked Wikipedia page discusses this issue)[1,2]

Not to be a dick, but getting the first example wrong like that doesn't inspire confidence in the rest of the post.

[1] https://en.wikipedia.org/wiki/Standard_deviation

[2] https://en.wikipedia.org/wiki/Unbiased_estimation_of_standar...


Seems more like a brain fart than an error in understanding, especially because he didn't highlight why it was important to have unbiasedness in the first place. Having to explain the unbiased estimator for standard deviation would probably not have contributed much. It's more of a technicality than anything of philosophical importance--the square root of the unbiased estimator of sample variance is pretty accurate.

The usage of n-1 instead of n in the denominator was a question I was always asked when I TA'ed for intro statistics classes in grad school. An explanation of unbiasedness might be warranted if this is to be an introductory primer.


Thanks! I have relabeled it. (In a previous draft the entry was for unbiased variance and the inaccuracy slipped in during the transition.)


Your post said, "draft," so I think you're covered for that sort of error. :)


This was the first time I'd ever heard about biased/unbiased sample variances, all inspired by the original web page and seeing the (N-1) in the standard deviation formula. This inspired me to go off and read about Bessel's correction in wikipedia.

Thank you, Evan Miller! Your web page is great, it taught me new stuff from the very first formula. Much appreciated.


I always make a point of reading comments that start with "Not to be a dick..." :P


Hey Evan, from one statistics guy to another, thanks for fighting the good fight :). The formulas might benefit from examples, especially with some of the more complicated cases (KS test and onwards). The important part of statistics comes from knowing _when_ to apply something, rather than _how_ to (that part is just math/numerical analysis). A mention of the assumptions of each of these intervals would be good, too. Too often I see conclusions invalidated by using a probability model that doesn't make sense. This is a common failure I see with using a Wald interval for the slope of the regression line.


Agreed 100%! As the introduction says this is supposed to be a "cheat-sheet" and a jumping-off point for further discovery... I'd love to see (and write!) more posts on when to apply things.


While you guys are here, can you recommend a good intro book for statistics?


Yankoff, you might want to be more specific. Intro statistics in general, or for computer scientists, or scientists, or looking to learn R at the same time? I liked Freedman, Pisani, and Purves [1], and have TA'ed using McClave, Sincich, and Mendenhall [2]. You may want something a little more advanced than these, but they are pretty good for intro level.

[1]: http://www.amazon.com/Statistics-4th-David-Freedman/dp/03939... [2]: http://www.amazon.com/Statistics-11th-Edition-Book-CD/dp/013... [3]: http://stats.stackexchange.com/questions/421/what-book-would...


Yeah, I meant something for computer scientists. I'm going over coursera ML course currently and wanted to learn at least basics of statistics in parallel.

Thanks, I'll check out your links.

Btw, what do you think of OpenIntro statistics?http://www.openintro.org/stat/down/OpenIntroStatSecond.pdf


You will find the intro books don't talk much about parallel computing. Most of the general data sets in intro books will be no more than 30 observations. They are trying to teach classical methods moreso than useful computational techniques. As for parallel statistics, I don't have a good book recommendation. Most of my knowledge on the topic comes from papers and vignettes from the R community and not books. Maybe check out one of those O'Reilly books about big data techniques?

I haven't seen this OpenIntro statistics before. I'll check it out!


You may like Wasserman's All of Statistics:

http://www.stat.cmu.edu/~larry/all-of-statistics/

It was written as an introduction to statistics for people in CS and related fields.


That OpenIntro pdf looks great! Thanks for the link.


Very intro (like, Stats 101 intro): Purves, Pisani, and Freedman's is the best book I've seen. I'd combine that with something like Tufte's or Wainer's statistical graphics books to try to get some sophistication (for lack of a better word).


I like the NIST/SEMATECH e-Handbook of Statistical Methods (http://www.itl.nist.gov/div898/handbook/).

The e-Handbook won't make you an expert statistician, but as an engineer needing to understand and apply statistical methods, I've found it to be a good starting point.


Check out the Intro to Stats course on Udacity:

https://www.udacity.com/course/st101

Just took it earlier this year. It was informative and I enjoyed the class.


As a programmer and statistician I should warn newcomers about the assumptions of tests (parametric tests like test t). Test t for example need same variance and normality or you'll take wrong decisions. These formulas and p-values shouldn't be used as substitutes for graphics (you should check them first and test later).

Just take care, there's a lot of problems from using statistics in wrong way. special care with small sample size and even large ones [http://scienceblogs.com/mixingmemory/2006/10/31/jeffreylindl...]


Nice but little knowledge is a dangerous thing. It is probably safer and more effective for non-statistician "data scientists" to use Robust Statistics: https://en.wikipedia.org/wiki/Robust_statistics


The wikipedia article talks much about dealing with outliers. How can outliers be removed/replaced or handled differently as the article suggests? Are the outliers not part of the data, after all? It seems like the goal of 'improving performance' here involves tweaking the data to get the results you want. What have I misunderstood here?


The main difference between classical regression using ordinary least squares (OLS) and robust regression using iterative re-weighted least squares (IRLS) is that with OLS, all observations are given equal weight and with IRLS, observations may or may not be given equal weight. Essentially, IRLS gives outliers and/or influential [1] data points less weight, which may improve the performance of the overall model since these outliers/influential data would otherwise cause assumption violations using classical regression. If there are no outliers, then results from robust and classical regression converge.

I would disagree with SagelyGuru in recommending robust regression for non-statisticians, though I can see where he or she is coming from. With robust regression, you don't have to worry as much about assumptions as with classical regression. But with robust regression, you need to be aware that the underlying analytical method is different and what that means. For example, the standard robust regression implementation in R (i.e, the rlm function in the MASS package) doesn't produce t-statistics or p-values. There're also warnings that especially at lower sample sizes, the standard errors produced by rlm may be unreliable. One recommended way to obtain those p-values would be to get bootstrapped standard error estimates, so that normal-theory approximation would apply.

[1] There are different types of robust estimators (e.g., M, S, MM, etc.) that have different robustness properties.


Thanks for the explanation - it was helpful


Mere 'formulas' without theory are dangerous.


A nice resource, however I'd expected to see some actual programming on this page. Perhaps some sample code in a few languages would be nice. You could put tabs on a code box to switch between a python and PHP implementation for example.


Same here. Theory, formula, assumptions etc. are available in a textbook, Wikipedia and elsewhere.

This could be useful- PHP stats functions: http://www.php.net/manual/en/ref.stats.php


I think if trying to explain things like these to programmers, maybe you should consider using actual code (or even pseudo code) for this? It would have an added benefit of not requiring to know math-english (for non-native english speakers like me) or any advanced concepts of math at all.


Great effort, and I certainly hope more coders will get into statistics (most I know are only interested in machine learning). However, I think your definition of 1.3 "Confidence Interval around the Mean" could be improved. You state:

"A confidence interval reflects the set of statistical hypotheses that won't be rejected at a given significance level. So the confidence interval around the mean reflects all possible values of the mean that can't be rejected by the data."

That seems a bit vague and perhaps confusing. Might I suggest something more like this:

"The confidence interval specifies a range (+/- a multiple of the above standard error [SE]) around our estimate of the mean (x-bar) such that: if we repeated our sampling process an infinite number of times (i.e. with the same sample size and forming a new x-bar and SE each time [and therefore, a new confidence interval]), Confidence_Level% of those intervals would contain the population (true) mean."

In addition, I think in this case, at least, there are no assumptions about the data to worry about, given a sufficiently moderate sample size due to the Central Limit Theorem (I'm confident about that in the case of the mean (x-bar), but I'll leave it up to others to correct me if I'm wrong about this applying to the standard error (SE)).


Confidence intervals are inherently confusing. I have yet to hear a definition that is both correct and easily understood and remembered.


I feel this would be quite confusing for an average programmer. It is more like a cheat sheet for people who have some statistical training but always have to look the formulas up because they don't use them frequently enough. For an average programmer, really understanding how linear regression works and some basic linear algebra would be a good start. A lot of programmers have trouble even with these "simple" topics.

Most of these formulas are very rarely used even by quantitative analysts. The most used are for standard deviation and regression. The more complicated ones are generally used as a part of statistical routines, say, in R. It is very rare that someone has to code them.

> From a statistical point of view, 5 events is > indistinguishable from 7 events. What is this supposed to mean? There is a concept of statistical significance but if an effect is not statistically significant it does not follow that it does not exist. Btw where is the Bayes formula? :)


This is a good motivation, but it's plunging right into formulas, which is wrong-headed. Even the mean can be completely meaningless without eyeballing a plot of the data. Given that this is aimed at non-statisticians, it's critical that these points are made in Big Flashing Letters before we start handing out formulas that make people act like they have "superpowers".

The first superpower is to look at the data and see if it makes sense using your eyes and your brain, not to start spewing confidence intervals.

As other posters have pointed out, it is even more irresponsible to start waving around things like the t-test without discussing the parametric assumptions that these things depend on for their validity.


Great start on an important topic. Quick extra info - For drawing a trend line it is often useful to have the intercept as well. Using y=mx+b line notation the best fit intercept is: \hat{b} = \bar{y} - m * \bar{x} [1]

Be great to see some pictures to illustrate the formulas and some mention of robust statistics as I find outliers to be a huge issue in application of statistical techniques.

[1] http://en.wikipedia.org/wiki/Simple_linear_regression


I've wondered for a long time if there's a way to condense certain statistical (and probability) information into a single How-Much-Should-I-Care number. Can someone shed some light on this?

To pick a couple examples from health news in the popular press:

NB: I'm making up all the numbers here for the sake of example.

(1) A study shows that people who consume more than 10g of added salt a day live shorter lives.

But how much shorter? If it's 30 minutes shorter, I don't care about the study and I'm not going to change my behavior. If it's 6 months longer, then I'm interested and might very well do something.

(2) A study shows that people who drink 2 or more cups of coffee a day have lower risk of Alzheimer's Disease.

But how much lower risk? If the average lifetime risk is 1 in 50, and drinking coffee lowers it to 1 in 49.997, then I don't want to waste time even reading the article. If it lowers it to 1 in a 1000, then yes, I might change my behavior.

So, in the above examples, is there any way to reduce the information into a single How-Much-Should-I-Care number?

Like this:

(1) A study shows that people who consume more than 10g of added salt a day have an ____x____ factor shorter life.

(2) A study shows that people who drink 2 or more cups of coffee a day have a ____y____ factor lower risk of Alzheimer's Disease.

Then, by looking at x and y, I can tell at a glance whether some result is irrelevant, trivial, useful, or groundbreaking. I understand that it'll still be subjective in the end -- like whether $1, $10, $1000, or $10,000,000 seems like a lot of money to an individual -- but at least it'll be one number.


Just a factor is often not enough. If a study would show that cell phone usage increased the risk of a certain cancer with 40%, that might still not be interesting if it's an extremely rare cancer and they found 7 cases instead of 5.

Or it doesn't show whether various correlation factors matter, or whether this is a statistical paradox. Did you know that babies of smokers are healthier than babies of non-smokers of the same weight? This is because baby born to the smoker will have decreased weight because of the smoking, whereas if the baby of the non-smoker is underweight it will be for other reasons that are often worse.

There's heaps of these kind of paradoxes and pitfalls that need to be taken in account.

What we need in newspapers and other media is a simplified abstract of the paper and an explanation or approval of a real life statistician, with no relation to the study.


See https://en.wikipedia.org/wiki/Odds_ratio for one answer. Another "single number" answer is generated by the class of tests that determine the statistical significance of an observation; see https://en.wikipedia.org/wiki/Alpha_level for more about this.

There are many other ways to accomplish what you're talking about. The biggest problem with your made-up examples is that they are just cases of "bad reporting."


Great resource. For anybody struggling with ANY formula... or even maths class. I'd encourage you to learn how Wolfram Alpha works:

http://www.wolframalpha.com/

I used it extensively for the development of different trading bots and algorithms. They say you need to be a real hotshot at maths to make it in that field. Little do they know that this world will soon belong to script-kiddies and hackers! Here's one of my favorites from them, this will rearrange any complex equation and make anything you like the subject:

http://www.wolframalpha.com/widgets/view.jsp?id=4be4308d0f9d...

The one thing you will have to learn is how to represent an equation in text based form, ie to use ^ to signify a power etc...


Man, that would have been useful in college.


Check this out for a demo of what it was designed to do.

This will solve the following complex equation (Find both X and Y):

x+y=10, x-y=4

http://www.wolframalpha.com/input/?i=x%2By%3D10%2C+x-y%3D4&#...

... and thats just scratching the surface. Kids studying maths these days don't know how good they have it. I struggled so much when I was younger.


Ugh recipe-book statistics! This might be useful but not to programmers. No-one should be applying any of the more complex formulas on this page without a good basic grounding in the theory, which is something nearly all programmers will lack (and fair enough). The best place for programmers to start is understanding the basics of likelihood functions, maximum likelihood estimation, Bayes rule and likelihood ratio tests. Then look at e.g. the derivation of the chi-squared test statistic as an approximation to a likelihood ratio test of multinomial distributions to get a sense of where all the mysterious formulas of classical statistics come from.


Thanks for writing this Evan.

Maybe this solidifies the fact that I'm a programmer and not a statistician, but I got lost after the 2nd hyperbole in the beginning...but I like a challenge, I may have to read it multiple times until your presentation sinks in though :)


The link to "how-to-read-an-unlabeled-sales-chart.html" has an extra s after unlabeled.


Tangent: Why is your product not a SaaS offering!? (I would pay for it)


Great blog! I wish there was a cookbook style code with eac formula. In C or even JavaScript for those of us who have forgotten a lot of math but can think in code.


I like the spirit of this article, but I wish it put much more emphasis on explaining why/when you'd need this. Examples would be very helpful.


Isn't the plural form of formula called formulae?


Either formulae and formulas are pretty common (and my Firefox spellcheck is flagging the first and not the second FWIW).


Same here, firefox flags formulae.


christ, if your going to post statistical formulas for programmers, post it as an algorithm, not as a math formula.


Evan Miller, I love you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: