This talk was great, especially its focus on non-parametric techniques. A lot of people are mislead by the traditional statistics approach, where it looks like stats is about memorizing a relatively short list of (really scary) formulas that let you compute one or another thing. People do not think about how these formulas come about, and the simple ideas behind them.
On the other hand, I think there is not enough written about how to do statistics with real, large, complex data sets. I tried to write something like this (https://pavpanchekha.com/blog/stats1.html) but of course the difficulty is that investigating a complex dataset is by definition too complex to really fit into a blog post. In a large dataset, the difficulties are the how and when you make decisions about what to analyze.
But in many cases, data is not easily accessible, and not cheap to store and process. I want to switch to an actual big data workflow with Spark, but running the numbers with EC2 storage and processing, it won't be cheap and I may have to set up a Patreon to subsidize things.
Riiiiight.... Not many places can afford a server with 6TB of RAM. Dear Lord, I didn't even know that was possible with commodity hardware. This would be hard to justify at a data-driven business even.
Storing your data on disk costs less money for the hardware per byte stored; storing your data in RAM costs less money per (transaction * byte stored).
On EC2, 6TB of RAM would be 25 r3.8xlarge instances, which is US$70 per hour. If your six-terabyte task is batch processing rather than interactive, that might be cheaper than spending a lot of time optimizing your software for spinning-rust relics.
Contrariwise, if it's an interactive computing task that you can do with AWS Lambda and which you can split into small pieces, 4000 Lambda functions each with 1.5 GiB of RAM, for a total of 6TB, will cost you US$0.01 per 100 milliseconds, or US$0.10 per second. Supposing your overall in-RAM computing task can be done with 10 seconds per Lambda function (a total of 40 000 CPU seconds (11 CPU hours)), can actually be parallelized to that extent, and that 10 seconds is an acceptable response time, then you can do it for US$1. I feel that this must be a mistake, since this is 700 times cheaper than EC2.
(Lambda surely has some total capacity limitations, but I imagine they're quite a bit more than 6TB.)
I am deeply unhappy about the move to such a centralized infrastructure for the internet, but at least this version of centralized infrastructure allows you to run your own code on the centralized servers, unlike the API vision promoted by Facebook and (mostly) Google.
Every textbook on math I have ever read always start with intuition behind the idea, and then present definitions and proofs. Only some reference books, intended for look ups only, only enumerate formulas. I never understand why anyone claims that they only learn math by memorizing formulas.
Because they learn math in junior high where they really don't care about the subject, so it gets boiled down to "remember this to pass the exam", regardless of what the book says.
People claim they only learn math by memorizing formulas because for most, that's all the math they see before getting irreversibly turned off by the subject. Most people learn math by memorizing plug and chug formulas in high school. Aside from 9th grade euclidean geometry, I can't recall an instance where I had to derive anything in my hs math courses. College math courses, of course, are usually much more rigorous.
For people who are interested in learning more about simulation, permutation tests and bootstrapping, the three techniques that the slide deck discusses, check out Allen Downey's Think Stats: http://greenteapress.com/thinkstats2/.
It really is a pity that introductions to statistics spend so much time on analytic approximations and so little on the underlying concepts.
I don't think that the title or the content claimed that "presenting mathematics in an understandable way" is a hack. "Statistics for Hackers" sounds to me like hackers are the intended audience.
This method of presentation makes sense to me. Most statistics classes that I've experienced were taught from the point of view of abstract math. That's certainly one way to do it, but I knew a lot of people for whom that wasn't the optimal presentation strategy. Now that computing is cheap and there is a large audience of people with programming knowledge, I think that teaching statistics through examples of simulation, bootstrapping shuffling and cross validation is a great way to learn things.
Isn't working around things to make them work for you exactly what the essence of hacking is?
Plus, doesn't every hacker start as a script kiddie? Using something others came up with usually sparks the imagination. The next step is usually using the scripts right, followed by using the scripts extensively and finding issues, followed by making your own scripts (or improving the existing ones) to avoid these issues.
In my opinion, the "for hackers" title is in reference to multiple books that have been released with the "X for hackers" that targets people with hacking skills but do not have a formal background in X.
EDIT:
I'm not the author, but you can find Bayesian Methods for Hackers (free, released by the author) at the link below. I think it's a great resource for anyone wanting to explore Bayesian methods using Python.
It's actually the opposite: the "hacking" is used to provide better data and analysis using programmatic techniques (bootstrapping, cross-validation) than just by taking a spot average in Excel as gospel.
I've seen many, many data scientists from startups write blog posts with skewed data without using such techniques and failing to identify the potential statistical problems.
Why are you so angry? He gave a talk and released the slides because he thought people may benefit from it - given Jake's (the speaker) background and skills, I would say he's doing everyone a favor by publicly releasing this.
Sometimes you can get a well-defined problem, but finding the "right" analytical solution will take you days of reading up on it, and the chance of getting it wrong is relatively high. Especially if you don't have someone with strong mathematical/statistical background to review your work.
In those cases, finding a programmatic hack around it is a very good approach for giving you reasonable results in a shorter timeframe.
"Physics for Poets" is physics presented in a way tailored to poets [1], not physics as done by poets—i.e., for poets, not by poets. Similarly, "Statistics for Hackers" is for hackers, not by hackers. Thus the focus on brute-force computational solutions via for loops, with which hackers would presumably be familiar.
[1]: Where poets is a possibly inaccurate metonym for "less mathematically sophisticated students"
What books would y'all recommend for someone who has taken a couple of college-level proof courses but never took anything on probability and statistics?
Best of luck. I can see from your post that you're thinking about performance tuning, I'm assuming you mean of software. That's a nice area - the nice part is that compared to fields like medical genetics, data on performance of software is relatively cheap to get, so a lot of issues about small sample sizes are surmountable.
I've said this elsewhere, but I recommend Casella & Berger's Statistical Inference. It will take you from probability theory to statistics... all the basics. I found it to be a very readable text, and it is used for many first graduate courses in stats - for me it was used for a 2 semester sequence first focusing on probability, then on statistics.
I like Casella and Berger, especially because it does so well at showing the connection between probability and statistics (first five chapters or so), but it should be noted that its approach is very different from the one in the slide deck above. Casella/Berger is purely frequentist rather than computational or Bayesian, and it spends a lot of time on likelihood theory and treats e.g. bootstrapping only very summarily.
I'd like to start with formalizing my knowledge of the basics -- hypothesis testing and confidence intervals. I do some performance tuning work, and it seems like a good idea to understand if my changes are making statistically significant improvements.
Those are what're thought of as standard "frequentist" basics; practical tools originally invented for practitioners. Many basic texts aiming to be statistics for scientists will do well, though unfortunately I don't have particular suggestions. Sorry about that!
"It doesn't always work" really is not a very useful thing to say unless you elaborate on when exactly it doesn't work and why.
Bootstrapping doesn't work for very low n (e.g. n=10) because the resamples are not smooth enough and it doesn't always work that well for estimating quantiles. But analytic methods fare pretty poorly in these circumstances too and in any case people are mostly interested in confidence intervals around the mean anyway.
You are assuming that distributions we work with are normal, and the sampled values are independent. If you take a fat tailed distribution (for example Cauchy), bootstrap wouldn't work. If values, are dependent it wouldn't work either.
That's nonsense. The entire point of the bootstrap is that it does not require the either the original or the sampling distribution to be normal. (For that matter, due to the Central Limit Theorem, neither do analytic approximations like a t-test for comparing different group means.)
You misunderstand the point about the Cauchy distribution in the answer on Cross Validated. The Cauchy distribution is a degenerate case, mostly interesting as an academic toy because it has infinite variance. Of course that's not going to fare well.
Dependent data can be tricky to deal with, but you can bootstrap such data by removing the dependence, bootstrapping the independent data, and adding the dependence back in. This sounds hard but is usually as easy as running a regression and subtracting/adding the component (x*beta) that leads to the dependency. Alternatively, for timeseries there's window methods.
Of course a short slide deck like the one linked to in this thread is not going to teach you all the finer points of bootstrapping and you can definitely do it wrong. But compared to all the assumptions that frequentist statistics makes to generate confidence intervals and the fact that you need a different method for every different scenario, bootstrapping is about as robust and idiot-proof as it's going to get. "It has its applicability" is beyond selling it short.
> For that matter, due to the Central Limit Theorem, neither do analytic approximations like a t-test for comparing different group means.
Actually, it doesn't. You need a confidence interval, and central limit theorem doesn't give you a confidence interval. It just says that the distribution is close enough to normal at some point. In some cases, it might take a very large n before it become close to normal.
And exactly how long it will take for the approximation to reach a certain level of accuracy can be ascertained by using the other technique mentioned in the slide deck: run a simulation. And so we have come full circle :-)
> The entire point of the bootstrap is that it does not require the either the original or the sampling distribution to be normal
What I wanted to say that you can just blindly apply bootstrap for every distribution. You should carefully check applicability conditions, before using it, or you could easily get nonsensical results.
You can't blindly apply the bootstrap to any problem whatsoever. But you really, absolutely, unequivocally can blindly apply the bootstrap to generate a confidence interval around the mean of data generated from any distribution. You cannot easily get nonsensical results. In a few rare edge cases, you can get suboptimal results (e.g. 80% instead of 95% coverage, or 99% coverage instead of 95% coverage) but even these are not nonsensical. I really don't see why you want to be arguing over facts.
In the example you link to, the example does not estimate the mean of a uniform distribution but rather its maximum – U(0, max). A maximum is like a quantile. One of the very few shortcomings of bootstrapping in realistic scenarios is that it does not always do well with generating confidence intervals around quantiles. And still scientists can go an entire lifetime without ever feeling the need to estimate the maximum of a uniform distribution. The only famous example of this distribution in the context of actual science is the German tank problem: https://en.wikipedia.org/wiki/German_tank_problem
The talk is very explicitly a tutorial for a few good statistical methods for those without a statistical background, using simple tools "hackers" are familiar with and are occasionally necessary, especially in the case of bootstrapping. This isn't hacking in the sense of "growth hacking," thankfully.
How so? I see anyone who refuses to accept things as they are, coupled with an innate curiosity and a drive towards getting to the root of a problem; and finally changing the rules to make it fit their prerogative, a hacker.
George Bernard Shaw would refer to this individual as an unreasonable man, but I prefer hacker.
On the other hand, I think there is not enough written about how to do statistics with real, large, complex data sets. I tried to write something like this (https://pavpanchekha.com/blog/stats1.html) but of course the difficulty is that investigating a complex dataset is by definition too complex to really fit into a blog post. In a large dataset, the difficulties are the how and when you make decisions about what to analyze.