Statistics for Hackers

pavpanchekha · on Sept 19, 2015

This talk was great, especially its focus on non-parametric techniques. A lot of people are mislead by the traditional statistics approach, where it looks like stats is about memorizing a relatively short list of (really scary) formulas that let you compute one or another thing. People do not think about how these formulas come about, and the simple ideas behind them.

On the other hand, I think there is not enough written about how to do statistics with real, large, complex data sets. I tried to write something like this (https://pavpanchekha.com/blog/stats1.html) but of course the difficulty is that investigating a complex dataset is by definition too complex to really fit into a blog post. In a large dataset, the difficulties are the how and when you make decisions about what to analyze.

minimaxir · on Sept 19, 2015

> the difficulty is that investigating a complex dataset is by definition too complex to really fit into a blog post

It's not complex, but it requires answering scaleability questions. How do you fit the data into memory? How do you even get big data?

I managed to analyze 5.6M Hacker News comments (http://minimaxir.com/2014/10/hn-comments-about-comments/ ) by writing a scraper which dumps everything into a PostgreSQL database (https://github.com/minimaxir/get-all-hacker-news-submissions... ) because it's way too big to fit in memory. Likewise, to visualize the recent NYC taxi data, (http://minimaxir.com/2015/08/nyc-map/ ), I just used BigQuery and saved myself a load of time downloading the data and setting things up.

But in many cases, data is not easily accessible, and not cheap to store and process. I want to switch to an actual big data workflow with Spark, but running the numbers with EC2 storage and processing, it won't be cheap and I may have to set up a Patreon to subsidize things.

MengerSponge · on Sept 19, 2015

This might be useful the next time you need to scope your project's memory requirements.

http://yourdatafitsinram.com/

eximius · on Sept 20, 2015

Riiiiight.... Not many places can afford a server with 6TB of RAM. Dear Lord, I didn't even know that was possible with commodity hardware. This would be hard to justify at a data-driven business even.

kragen · on Sept 21, 2015

Storing your data on disk costs less money for the hardware per byte stored; storing your data in RAM costs less money per (transaction * byte stored).

On EC2, 6TB of RAM would be 25 r3.8xlarge instances, which is US$70 per hour. If your six-terabyte task is batch processing rather than interactive, that might be cheaper than spending a lot of time optimizing your software for spinning-rust relics.

Contrariwise, if it's an interactive computing task that you can do with AWS Lambda and which you can split into small pieces, 4000 Lambda functions each with 1.5 GiB of RAM, for a total of 6TB, will cost you US$0.01 per 100 milliseconds, or US$0.10 per second. Supposing your overall in-RAM computing task can be done with 10 seconds per Lambda function (a total of 40 000 CPU seconds (11 CPU hours)), can actually be parallelized to that extent, and that 10 seconds is an acceptable response time, then you can do it for US$1. I feel that this must be a mistake, since this is 700 times cheaper than EC2.

(Lambda surely has some total capacity limitations, but I imagine they're quite a bit more than 6TB.)

I am deeply unhappy about the move to such a centralized infrastructure for the internet, but at least this version of centralized infrastructure allows you to run your own code on the centralized servers, unlike the API vision promoted by Facebook and (mostly) Google.

minimaxir · on Sept 19, 2015

Hence the "it won't be cheap" qualifier. :P

calpaterson · on Sept 20, 2015

I think he was talking about the complexity of the analysis and not about the technical complexity of getting and storing the data.

minimaxir · on Sept 20, 2015

Storage/scalability must be solved before doing the complex analysis, through, and is the primary bottleneck.

rsy96 · on Sept 20, 2015

Every textbook on math I have ever read always start with intuition behind the idea, and then present definitions and proofs. Only some reference books, intended for look ups only, only enumerate formulas. I never understand why anyone claims that they only learn math by memorizing formulas.

wodenokoto · on Sept 20, 2015

Because they learn math in junior high where they really don't care about the subject, so it gets boiled down to "remember this to pass the exam", regardless of what the book says.

eachro · on Sept 20, 2015

People claim they only learn math by memorizing formulas because for most, that's all the math they see before getting irreversibly turned off by the subject. Most people learn math by memorizing plug and chug formulas in high school. Aside from 9th grade euclidean geometry, I can't recall an instance where I had to derive anything in my hs math courses. College math courses, of course, are usually much more rigorous.

pavpanchekha · on Sept 20, 2015

Take a look at what AP Stats teaches.

stdbrouw · on Sept 20, 2015

For people who are interested in learning more about simulation, permutation tests and bootstrapping, the three techniques that the slide deck discusses, check out Allen Downey's Think Stats: http://greenteapress.com/thinkstats2/.

It really is a pity that introductions to statistics spend so much time on analytic approximations and so little on the underlying concepts.

analognoise · on Sept 19, 2015

Can we please stop putting "hack" on EVERYTHING?

It isn't a hack to present mathematics in an understandable way - it's a pedagogical improvement for introductory works, not a "hack".

texthompson · on Sept 19, 2015

I don't think that the title or the content claimed that "presenting mathematics in an understandable way" is a hack. "Statistics for Hackers" sounds to me like hackers are the intended audience.

This method of presentation makes sense to me. Most statistics classes that I've experienced were taught from the point of view of abstract math. That's certainly one way to do it, but I knew a lot of people for whom that wasn't the optimal presentation strategy. Now that computing is cheap and there is a large audience of people with programming knowledge, I think that teaching statistics through examples of simulation, bootstrapping shuffling and cross validation is a great way to learn things.

r3bl · on Sept 19, 2015

Isn't working around things to make them work for you exactly what the essence of hacking is?

Plus, doesn't every hacker start as a script kiddie? Using something others came up with usually sparks the imagination. The next step is usually using the scripts right, followed by using the scripts extensively and finding issues, followed by making your own scripts (or improving the existing ones) to avoid these issues.

This looks pretty hackish to me.

minimaxir · on Sept 19, 2015

Actually the example code is pretty close to the code you would use for more formal analysis. (naturally, real-world data needs more cleaning)

anewhnaccount · on Sept 19, 2015

The hacking here refers to the approach of "just compute it" rather than using an analytical approach.

rankam · on Sept 19, 2015

In my opinion, the "for hackers" title is in reference to multiple books that have been released with the "X for hackers" that targets people with hacking skills but do not have a formal background in X.

Machine Learning for Hackers http://www.amazon.co.uk/Machine-Learning-Hackers-Drew-Conway...

Design for Hackers http://www.amazon.co.uk/Design-Hackers-Reverse-Engineering-B...

Bayesian Methods for Hackers http://www.amazon.co.uk/Bayesian-Methods-Hackers-Probabilist...

EDIT: I'm not the author, but you can find Bayesian Methods for Hackers (free, released by the author) at the link below. I think it's a great resource for anyone wanting to explore Bayesian methods using Python.

https://github.com/CamDavidsonPilon/Probabilistic-Programmin...

minimaxir · on Sept 19, 2015

It's actually the opposite: the "hacking" is used to provide better data and analysis using programmatic techniques (bootstrapping, cross-validation) than just by taking a spot average in Excel as gospel.

I've seen many, many data scientists from startups write blog posts with skewed data without using such techniques and failing to identify the potential statistical problems.

analognoise · on Sept 19, 2015

Then it's even worse than "not hacking" - it's not useful.

Any idiot can run a simulation, or compute "something". "Something" is only useful in context.

"If you can write a for loop, you can do statistics!" Ugh. Can nobody read anymore? It has to be a "slide deck"?

Jesus, pick up a book and ask some goddamn questions if you're interested.

rankam · on Sept 19, 2015

Why are you so angry? He gave a talk and released the slides because he thought people may benefit from it - given Jake's (the speaker) background and skills, I would say he's doing everyone a favor by publicly releasing this.

Calm down.

lorg · on Sept 19, 2015

Sometimes you can get a well-defined problem, but finding the "right" analytical solution will take you days of reading up on it, and the chance of getting it wrong is relatively high. Especially if you don't have someone with strong mathematical/statistical background to review your work.

In those cases, finding a programmatic hack around it is a very good approach for giving you reasonable results in a shorter timeframe.

whatok · on Sept 19, 2015

This is a slide deck accompanying the author's presentation he gave here:

http://www.meetup.com/Multithreaded-Data/events/225205209/

I think you're taking things a bit too far here.

mhartl · on Sept 20, 2015

"Physics for Poets" is physics presented in a way tailored to poets [1], not physics as done by poets—i.e., for poets, not by poets. Similarly, "Statistics for Hackers" is for hackers, not by hackers. Thus the focus on brute-force computational solutions via for loops, with which hackers would presumably be familiar.

[1]: Where poets is a possibly inaccurate metonym for "less mathematically sophisticated students"

jmpeax · on Sept 20, 2015

It will pass, just like when use of that ridiculous word "grok" passed.

int3 · on Sept 19, 2015

What books would y'all recommend for someone who has taken a couple of college-level proof courses but never took anything on probability and statistics?

texthompson · on Sept 20, 2015

I'm a big fan of two books:

* The Elements of Statistical Learning, by Hastie, Tibshirani and Friedman (https://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLI...). * Probability Theory: The Logic of Science, by ET Jaynes (http://bayes.wustl.edu/etj/prob/book.pdf)

Best of luck. I can see from your post that you're thinking about performance tuning, I'm assuming you mean of software. That's a nice area - the nice part is that compared to fields like medical genetics, data on performance of software is relatively cheap to get, so a lot of issues about small sample sizes are surmountable.

chestervonwinch · on Sept 20, 2015

I've said this elsewhere, but I recommend Casella & Berger's Statistical Inference. It will take you from probability theory to statistics... all the basics. I found it to be a very readable text, and it is used for many first graduate courses in stats - for me it was used for a 2 semester sequence first focusing on probability, then on statistics.

stdbrouw · on Sept 20, 2015

I like Casella and Berger, especially because it does so well at showing the connection between probability and statistics (first five chapters or so), but it should be noted that its approach is very different from the one in the slide deck above. Casella/Berger is purely frequentist rather than computational or Bayesian, and it spends a lot of time on likelihood theory and treats e.g. bootstrapping only very summarily.

saboot · on Sept 20, 2015

I second this, also for teaching yourself there is a solutions manual available which is great for doing the exercises with.

tel · on Sept 19, 2015

What kind of stats work are you interested in?

int3 · on Sept 20, 2015

I'd like to start with formalizing my knowledge of the basics -- hypothesis testing and confidence intervals. I do some performance tuning work, and it seems like a good idea to understand if my changes are making statistically significant improvements.

tel · on Sept 20, 2015

Those are what're thought of as standard "frequentist" basics; practical tools originally invented for practitioners. Many basic texts aiming to be statistics for scientists will do well, though unfortunately I don't have particular suggestions. Sorry about that!

_82mp · on Sept 19, 2015

Anybody have a link to the actual talk?

TwoFx · on Sept 19, 2015

There is none. https://twitter.com/jakevdp/status/644863206339379201

spenczar5 · on Sept 20, 2015

Bootstrapping deserves to be much better known. The fact that it works is miraculous.

solomatov · on Sept 20, 2015

It doesn't always work, it has its applicability but it's wide enough.

stdbrouw · on Sept 20, 2015

"It doesn't always work" really is not a very useful thing to say unless you elaborate on when exactly it doesn't work and why.

Bootstrapping doesn't work for very low n (e.g. n=10) because the resamples are not smooth enough and it doesn't always work that well for estimating quantiles. But analytic methods fare pretty poorly in these circumstances too and in any case people are mostly interested in confidence intervals around the mean anyway.

solomatov · on Sept 20, 2015

You are assuming that distributions we work with are normal, and the sampled values are independent. If you take a fat tailed distribution (for example Cauchy), bootstrap wouldn't work. If values, are dependent it wouldn't work either.

See also here: http://stats.stackexchange.com/questions/172920/bootstrap-me... They give example of non pathologic distribution for which bootstrap doesn't provide good estimate. It's unform(0, theta)

stdbrouw · on Sept 20, 2015

That's nonsense. The entire point of the bootstrap is that it does not require the either the original or the sampling distribution to be normal. (For that matter, due to the Central Limit Theorem, neither do analytic approximations like a t-test for comparing different group means.)

You misunderstand the point about the Cauchy distribution in the answer on Cross Validated. The Cauchy distribution is a degenerate case, mostly interesting as an academic toy because it has infinite variance. Of course that's not going to fare well.

Dependent data can be tricky to deal with, but you can bootstrap such data by removing the dependence, bootstrapping the independent data, and adding the dependence back in. This sounds hard but is usually as easy as running a regression and subtracting/adding the component (x*beta) that leads to the dependency. Alternatively, for timeseries there's window methods.

Of course a short slide deck like the one linked to in this thread is not going to teach you all the finer points of bootstrapping and you can definitely do it wrong. But compared to all the assumptions that frequentist statistics makes to generate confidence intervals and the fact that you need a different method for every different scenario, bootstrapping is about as robust and idiot-proof as it's going to get. "It has its applicability" is beyond selling it short.

solomatov · on Sept 20, 2015

> For that matter, due to the Central Limit Theorem, neither do analytic approximations like a t-test for comparing different group means.

Actually, it doesn't. You need a confidence interval, and central limit theorem doesn't give you a confidence interval. It just says that the distribution is close enough to normal at some point. In some cases, it might take a very large n before it become close to normal.

stdbrouw · on Sept 20, 2015

And exactly how long it will take for the approximation to reach a certain level of accuracy can be ascertained by using the other technique mentioned in the slide deck: run a simulation. And so we have come full circle :-)

solomatov · on Sept 20, 2015

> The entire point of the bootstrap is that it does not require the either the original or the sampling distribution to be normal

What I wanted to say that you can just blindly apply bootstrap for every distribution. You should carefully check applicability conditions, before using it, or you could easily get nonsensical results.

stdbrouw · on Sept 20, 2015

You can't blindly apply the bootstrap to any problem whatsoever. But you really, absolutely, unequivocally can blindly apply the bootstrap to generate a confidence interval around the mean of data generated from any distribution. You cannot easily get nonsensical results. In a few rare edge cases, you can get suboptimal results (e.g. 80% instead of 95% coverage, or 99% coverage instead of 95% coverage) but even these are not nonsensical. I really don't see why you want to be arguing over facts.

solomatov · on Sept 20, 2015

But what about the example with uniform distribution? How can we get confidence interval in this case with bootstrap?

stdbrouw · on Sept 20, 2015

In the example you link to, the example does not estimate the mean of a uniform distribution but rather its maximum – U(0, max). A maximum is like a quantile. One of the very few shortcomings of bootstrapping in realistic scenarios is that it does not always do well with generating confidence intervals around quantiles. And still scientists can go an entire lifetime without ever feeling the need to estimate the maximum of a uniform distribution. The only famous example of this distribution in the context of actual science is the German tank problem: https://en.wikipedia.org/wiki/German_tank_problem

minimaxir · on Sept 19, 2015

The talk is very explicitly a tutorial for a few good statistical methods for those without a statistical background, using simple tools "hackers" are familiar with and are occasionally necessary, especially in the case of bootstrapping. This isn't hacking in the sense of "growth hacking," thankfully.

hartator · on Sept 20, 2015

Coin toss is a bad example to take as coins can't be loaded. However, very interesting.

texthompson · on Sept 20, 2015

Coins can be biased, they just need to be bent so far out of shape that it's obvious. Here's a fun example:

https://izbicki.me/blog/how-to-create-an-unfair-coin-and-pro...

GFK_of_xmaspast · on Sept 20, 2015

Why would you possibly think that? ( see also: http://news.stanford.edu/pr/2004/diaconis-69.html )

porter · on Sept 19, 2015

This needs to be a course. I'd pay for it.

ziles88 · on Sept 20, 2015

The term hacker is just so overused at this point. Does it just mean someone who is good at tech in the Bay?

code_sterling · on Sept 20, 2015

How so? I see anyone who refuses to accept things as they are, coupled with an innate curiosity and a drive towards getting to the root of a problem; and finally changing the rules to make it fit their prerogative, a hacker.

George Bernard Shaw would refer to this individual as an unreasonable man, but I prefer hacker.