Think Stats, using python to learn stats

emehrkay · on March 18, 2011

Thanks, I'll put this on my ipad and go through it because I am interested in stats.

Did a professor use a non-commercial book for their class? If so, that is amazing.

candre717 · on March 18, 2011

Yes. Olin, the college where the author teaches, encourages that kind of stuff. I hope they are the start to a trend.

elai · on March 18, 2011

My compliers professor did. The notes were so extensive though that you didn't really need to look at the book. And it was written by a finn, so the english was a little bit off.

pama · on March 18, 2011

I commend the effort to teach statistics to python programmers, but I'd recommend interfacing with R for creating the figures. R will create beautiful figures by default; it can also handle sophisticated stats, should the need arise. The current figures are filled with unneeded inwards-facing ticks that often overlap with data, and have tick and axis labels of inconsistent font sizes. The negative numbers in the y-axis in Fig 2.3 and x-axis in Fig 4.6 and 7.1 don't display properly, and the presentation of the data in Fig 3.1 is too noisy to deliver a useful message.

ZeroGravitas · on March 18, 2011

"One of the reasons we are using a general-purpose language rather than a stats language like R is that for many projects the "hard" part is preparing the data, not doing the analysis."

quote from week 2 of the accompanying lecture notes:

https://sites.google.com/site/thinkstats2010b/lecture-notes/...

woodson · on March 19, 2011

I find preparing the data is actually easier in R than in something like python, but that certainly depends on type and amount of data. And, of course, each to their own..

araneae · on March 18, 2011

I think the point is to learn statistics using an easier language than R, ugly figures be damned.

phren0logy · on March 18, 2011

I see this as a real issue. R is useful but sorta crappy as languages go. I have hopes for Incanter [http://incanter.org/] (a clojure-based stats environment), but it looks stalled out vs dead in the water. Even still, for basic stats it's pretty viable.

I would love to see a stats environment based on a sane, full-featured language that was ready to take advantage of multiple cores and GPU computing. Clojure has all the building blocks, but it's not built yet.

araneae · on March 18, 2011

I have to confess that whenever I do any statistics (which isn't much, these days) I struggle with R for a bit and then say "screw it" and use MatLab.

pama · on March 18, 2011

Quick R, helped several of my friends try R: http://www.statmethods.net/

Once you're hooked, go to the R inferno: http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

JoshCole · on March 19, 2011

I had been planning to work on Incanter over the summer as part of Google Summer of Code, but things didn't pan out. The organization promoting the project wasn't accepted as a mentoring organization.

Now I have to find another internship.

rch · on March 18, 2011

Well, the R syntax is more intuitive than statistics itself. Then again, what isn't?

Really though- rpy2 is a nice enough interface to R, and things like MCMCPack make R worth the effort.

graywh · on March 18, 2011

What about R do you think makes it hard?

GregBuchholz · on March 18, 2011

I find that ticks aren't enough to aid in real problems, and I always turn on the the grid with 'grid(True)' in matplotlib. Also, the negative number problems must be some kind of anomaly; I've never seen that problem, and I've used several revisions of matplotlib every day for the past couple of years.

pama · on March 18, 2011

Out of curiosity, why do you need a grid? (I need very few ticks myself and almost never a grid. I've found that if the visual message isn't clear enough or needs a different quantitative analysis or display, I'll request the raw data.)

GregBuchholz · on March 18, 2011

I'm almost always interested in the actual values of things when debugging a part our systems at work. I need to know that the generator exceeded 3kW, or the PDF of the voltage was above 4V, or things went hay-wire at 11:31AM. It is easier to tell at a glance what's happening on a plot if there is a grid.

Jach · on March 18, 2011

There's only a million other plotting libraries you can use with Python if you don't like what pyplot spits out. I think R's pie charts are hideous myself. (And I wouldn't ever use R's default output for a BI chart.)

ylem · on March 18, 2011

matplotlib has an excellent plotting package and I've been using it in publications.

PostOnce · on March 19, 2011

To me, Allen Downey is the Salman Khan of programming education.

dmix · on March 18, 2011

How heavily is this tied to Python?

Can a ruby/JS hacker with no python experience jump into this book?

FraaJad · on March 18, 2011

Come on young rubyist. embrace the dark side. You will like it. /jk

I went through the fist couple of chapters and all the python code I've seen so far is quite close to "pseudo-code" than deeply-idiomatic Python.

Considering the fact that the Prof. also wrote "How To Think Like a Computer Scientist: Learning with {Python/Java/C++}", you can be rest assured that he knows how not to trip up students with clever syntax.

verysimple · on March 19, 2011

Come on young rubyist. embrace the dark side. You will like it. /jk

I adopted Python because I was told Ruby is the dark side. oO

halostatue · on March 18, 2011

Probably.

  % gem install rubypython
  % wget http://thinkstats.com/Pmf.py
  % irb
  >> require 'rubypython'
  => true
  >> RubyPython.start
  => true 
  >> sys = RubyPython.import 'sys'
  => <module 'sys' (built-in)> 
  >> sys.path.append '.'
  => None 
  >> Pmf = RubyPython.import 'Pmf'
  => <module 'Pmf' from './Pmf.py'> 
  >> hist = Pmf.MakeHistFromList([1, 2, 2, 3, 5])
  => <Pmf.Hist object at 0x7f6681a4f490> 
  >> hist.Freq(2)
  => 2 
  >> hist.Freq(4)
  => 0 
  >> hist.Values()
  => [1, 2, 3, 5] 
  >> RubyPython.stop
  => true

The above adapted from part of section 2.4 (http://www.greenteapress.com/thinkstats/html/book003.html#to...).

scott_s · on March 18, 2011

Python examples are embedded in the text, explaining what is being done. If you're familiar with Ruby, you should be able to do an on-the-fly mapping. But, it's easy to see for yourself: http://www.greenteapress.com/thinkstats/thinkstats.pdf

AllenDowney · on March 19, 2011

I tried to keep the code simple -- no Python esoterica, only a few objects. So it should be easy to pick up the Python as you go along.

But let me know whether that turns out to be true.

phren0logy · on March 19, 2011

Just wanted say thanks for all you contribute. Your textbook manifesto is inspirational. Also, I wanted to offer some encouragement on your Complexity book. It looks lke it's off to a great start.

Any chance we'll see a Clojure version of any of your books? It seems to be shaping up as a good choice for scientific computing.

AllenDowney · on March 19, 2011

Thanks! I am hoping to get back to the Complexity book soon. I want to finish the section on agent-based modeling (and I need to remove some of the stats material that I poached for Think Stats).

Clojure looks interesting, but I don't know much about it, so I probably won't do anything with it soon. But part of the reason I use free licenses is so that other people can adapt my books.

Nicholas Monje is working on Think OCaml (http://thinkocaml.com); that might be the best version to translate into Clojure.

malloc · on March 19, 2011

anyone knows of a similar resource for physics?

ylem · on March 19, 2011

What area of physics do you want to know about? There are some free books (for example I know of one on small angle neutron scattering that you can find at NIST)--also there are many lecture notes online. For texts, though do you think it would be better to have a static PDF, or a wiki, where an author could add what they could and the community could improve it?

NY_USA_Hacker · on March 20, 2011

I looked through Downey's book.

(A) Definition of Probability.

As far as I could tell, the book defined probability in only one of two ways:

(1) Relative frequency assuming finitely many trials (page 13) and

(2) Bayesian prior intuitive belief (page 52).

No, here the book is doing readers a disservice.

(B) Exponential Distribution

The 'derivation' of the exponential distribution on page 37:

"I’ll start with the exponential distribution because it is easy to work with. In the real world, exponential distributions come up when we look at a series of events and measure the times between events, which are called inter-arrival times. If the events are equally likely to occur at any time, the distribution of inter-arrival times tends to look like an exponential distribution."

is a bit too vague.

(C) Random.

The book also uses 'random' too frequently, loosely, and unnecessarily. Instead, essentially we can just drop the word 'random' and avoid questions about what 'random' means and if some data is 'truly random'.

For (A), essentially universally in advanced work, there is only one approach:

We have a non-empty set of 'trials'. Then an 'event' is a subset of the set of all trials. The set of all trials is an also an event. The set of all events is closed under countable unions and relative complements. In particular, the empty set is an event.

A 'probability' is a function P from the set of all events to the interval [0,1]. For an event A, P(A) is its probability. The probability of the set of all trials is 1.

P is 'countably additive': For countably infinitely many pair-wise disjoint events A_i, for i = 1, 2, ..., the probability of the the union of the A_i is the sum of the P(A_i).

A 'random variable' X is a function from the set of all trials to the real numbers such that for any real number x the set of all trials w such that X(w) <= x is an event. Usually we suppress the notation of the trials and just write the event X <= x. This is our only use of 'random', and we give no more definition of it.

In practice, essentially any number at all that we observe can be regarded as a random variable. In particular, the number might have been something observed about a person. The statement on page (52) that

"Anything involving people is pretty much off the table."

would be a silly foundation for probability.

For Bayesian and prior beliefs, can regard such a belief as an estimate of a conditional probability where we condition on what we know. E.g., we know a lot about people so that given a person, we can guess that with probability over 99% the person is less than 10 feet tall. This point does not mean that we have a 'Bayesian' foundation for probability. It's better just to drop mention of 'Bayesian'.

The 'distribution' of a random variable X is the function F_X(x) = P(X <= x).

Full details are in any of, say,

Jacques Neveu, 'Mathematical Foundations of the Calculus of Probability', Holden-Day.

Leo Breiman, 'Probability', ISBN 0-89871-296-3, SIAM.

M. Loeve, 'Probability Theory, I and II, 4th Edition', Springer-Verlag.

Kai Lai Chung, 'A Course in Probability Theory, Second Edition', ISBN 0-12-174650-X, Academic Press.

Yuan Shih Chow and Henry Teicher, 'Probability Theory: Independence, Interchangeability, Martingales', ISBN 0-387-90331-3, Springer-Verlag.

Essentially what is going on is that we are defining P as a non-negative real 'measure' with 'total mass 1' as in any of:

Paul R. Halmos, 'Measure Theory', D. Van Nostrand Company, Inc.

H. L. Royden, 'Real Analysis: Second Edition', Macmillan.

Walter Rudin, 'Real and Complex Analysis', ISBN 07-054232-5, McGraw-Hill.

This approach to probability is the one used in essentially all advanced work, e.g.,

Ioannis Karatzas and Steven E. Shreve, 'Brownian Motion and Stochastic Calculus, Second Edition', ISBN 0-387-97655-8.

Jean-Rene Barra, 'Mathematical Basis of Statistics', ISBN 0-12-079240-0, Academic Press.

Early in Neveu can see some ways we get essentially pushed into this foundation for probability.

With this approach, for a random variable X, we define its expectation E[X] as just its integral with respect to measure P in the sense of measure theory. Then essentially by a routine 'change of variable' we can get E[X] as an integral in terms of the distribution F_X and the measure it defines on the reals.

With this approach, if, say, for some positive integer n we measure the height of n people, then we say that we have the values of the n random variables

X_1, X_2, ..., X_n

and do not say that we have the values of n trials of one random variable X.

Indeed, with this approach, all our experience, all of the universe, is only one trial.

Given a set of events, we can define what it means for the set to be 'independent'. And we can proceed similarly for random variables.

Then we can give a more careful statement of the central limit theorem and also state the law of large numbers. We can also say why we use an average to estimate an expectation, e.g., as in

Paul R. Halmos, "The Theory of Unbiased Estimation", 'Annals of Mathematical Statistics', Volume 17, Number 1, pages 34-43, 1946.

On page 12 there is:

"The mean of this sample is 100 pounds, but if I told you 'The average pumpkin in my garden is 100 pounds,' that would be wrong, or at least misleading. In this example, there is no meaningful average because there is no typical pumpkin."

No: An 'average' does not have to be 'typical' to be 'meaningful'. The expectation of the weight X of a pumpkin in the garden remains meaningful. And the law of large numbers will still apply.

For the exponential distribution, there is a good 'qualitative' derivation: If we have arrivals where the inter-arrival times are stationary and independent, then the inter-arrival times are independent and have the same exponential distribution. Good details are early in the chapter on Poisson processes in

Erhan Cinlar, 'Introduction to Stochastic Processes', ISBN 0-13-498089-1, Prentice-Hall.

Downey's book places a lot of emphasis on distributions, e.g., Weibull, and descriptions, e.g., skewness. So an implication is, given some data, we should try to find its distribution and various summary properties of it, but usually this is unpromising. It is good to have the concept of a distribution but, given some data, usually not good to try to find its distribution. Instead, we get our results by manipulating the data in ways where we need to know little or nothing about the distribution.

The strong emphasis on hypothesis testing is curious. There is more in:

E. L. Lehmann, 'Testing Statistical Hypotheses', John Wiley and Sons.

E. L. Lehmann, 'Nonparametrics: Statistical Methods Based on Ranks', ISBN 0-8162-4994-6.

Downey's book mentioned the chi-square distribution: The main point is that if X_1, X_2, ..., X_k are random variables with Gaussian distribution with mean 0 and variance 1, then

Y = X_1^2 + X_2^2 + ... + X_k^2

has chi-squared distribution with k degrees of freedom.

Downey's book mentioned the variance of a finite population; while variance is of high interest, the variance of a finite population is not. What is usually wanted is an estimate of variance, and for that the formula in the book should be dividing by n - 1 instead of n; this change yields an 'unbiased' estimate.

It would be good for students to get at least an intuitive understanding of the most powerful hypothesis testing via the Neyman-Pearson result: Using an analogy from real estate investing, regard the false alarm rate as money to be spent and spend the money to get the best return on investment buy buying the property with the highest ROI first, then the one with the second highest, etc. Yes, in principal this process leads to a knapsack problem which is NP-complete. Likely the nicest proof of the Neyman-Pearson result is from the Hahn decomposition; this follows from the Radon-Nikodym result with a famous proof by von Neumann in Rudin.

The mention of 'resampling' is curious: There are cases where we won't have independent random variables but will have 'exchangeable' random variables, and exchangeability can be enough.

One example is extending nonparametric tests to multidimensional data and using it for anomaly detection in server farms and networks where multidimensional data is ubiquitous, e.g.,

N. B. Waite, "A Real-Time System-Adapted Anomaly Detector", 'Information Sciences', volume 115, April, 1999, pages 221-259.