An Introduction to Statistics with Python

Tycho · on June 1, 2015

iPython Notebook is great. Making it the norm for exploring/teaching ideas in a coding context is a great trend.

One other thing I hope comes along is annotated mathematical formulas/expressions. Any time I see a mathematical symbol I want to be able to hover over it and get its definition (in the context it's being used) and other relevant information. So often I see formulas in articles/books where they don't explain everything. It makes it really hard for outsiders/amateurs to catch on to the ideas being discussed.

baldfat · on June 1, 2015

IPython is becoming Jupyter [https://jupyter.org/] which has made IPython language agnostic. I have used it with Julia, R and Haskel.

It is a great REPL in a browser experience. Personally RStudio has done pretty amazing things and it has the ability to also run with different kernels but it is specifically focused on R.

> Syntax highlighting modes for many new languages including Clojure, CoffeeScript, C#, Graphviz, Go, Groovy, Haskell, Java, Julia, Lisp, Lua, Matlab, Perl, Ruby, Rust, Scala, and Stan. [http://www.rstudio.com/products/rstudio/download/preview-rel...]

nileshtrivedi · on June 1, 2015

Hover is not possible on touch devices. From 5 possible interactions (left click, right click, left doubleclick, right double click, hover), we've gone to 2 (short press, long press). On the other hand, we have acquired gestures like pinch-zoom, so it may not be a big loss.

tangue · on June 1, 2015

High-end Samsung Galaxy phones have a kind of hover effect (called "air-view") through an infrared sensor, so it's quite possible that in the future this could become a common feature.

arikrak · on June 1, 2015

Macbooks and other laptops provide pinch-zoom and other gestures. Maybe phones and tablets could detect a very light touch as hover?

tunnuz · on May 31, 2015

Also relevant: http://greenteapress.com/thinkstats by Allen B. Downey (free PDF but available as O'Reilly print).

gjreda · on June 1, 2015

His Think Bayes is another good one: http://greenteapress.com/thinkbayes/

graffitici · on June 1, 2015

Can people comment as to the quality of this article? Would it be a good place for a refresher on statistics? Are there any other such projects that people can recommend?

I like the way it has a strong Python focus.

grayclhn · on June 1, 2015

I wouldn't use it as a stats book. The coverage is very spotty and seems a bit old fashioned. Lots of different test statistics, not too much intuition from what I could see, and too little coverage of most things to be useful (ie the bootstrap gets a few paragraphs, Bayesian stats gets just a little more, etc).

I don't know of good stats books focusing on Python, but I'm sure there are plenty. "An Introduction to Statistical Learning" is free online [1] but it emphasizes R and has very little overlap, so I don't know if it addresses your needs, but it's very good.

1: http://www-bcf.usc.edu/~gareth/ISL/

_ikuh · on June 1, 2015

Well Python is right there in the title! ;)

It looks like more of an ebook and it seems to cover most of the essentials in your typical undergrad course + bootstrapping and intro to bayesian. From what I scanned it might be a little slow if you're just looking for a stats refresher -- depending on your level of experience I'd recommend the numpy and pandas docs as a "next step up" from the link (and maybe a bit more practical IMO).

fluential · on June 1, 2015

Also relevant - Statistics for Engineers Tutorial at SRECon15 - https://github.com/HeinrichHartmann/StatisticsTutorial

Lofkin · on June 1, 2015

More good python resources: http://web.bryant.edu/~bblais/statistical-inference-for-ever...

Harvard Data science class, in python: http://cs109.github.io/2014/

jeo1234 · on June 1, 2015

I still have to finish reading it, but I wonder how python will ultimately stack up against R?

Lofkin · on June 1, 2015

Python has more consistent stats, time series and programming syntax (for the packages it does have), and better bayesian inference package (pymc 3 > stan).

Python is also better than R for ad hoc statistical modeling and algorithim development (you can write python code on the order of C fast with numba) , general programming, scraping, natural language processing, agent based modeling etc.

Python is also better for GIS, optimization, symbolic math and larger datasets with blaze and dask and pyspark.

R right now is a bit better for visualization, reporting and exploratory data analysis (I think this will change soon though with Bokeh and blaze) and has many more esoteric stats packages.

With statsmodels, pymc3, pandas and scikitlearn etc you can probably do 98% of everything you need in python without needing to dip into the more esoteric packages of R (with some exceptions). For everything else, you can call R packages with Rpy2. With this you get all the advantages of working in one language (not spread too thin) and the advantages that python offers while leveraging R's wealth of packages.

The latter is a bit more difficult through python, but not as hard as trying to remember the syntax of and gluing together a two language workflow.

That is why I chose python... Also I can write excel addins with python (xlwings) instead of using VBA

data_scientist · on June 1, 2015

Genuine question, why do you find Pymc3 better than stan? Pymc3 is still in alpha, while Stan is stable since many years. They both implements the same algorithm (NUTS), so is it just the nicer syntax or is it anything else?

peatmoss · on June 1, 2015

I've made minimal use of Stan, and not really used Pymc3, but from a quick look, it seems Pymc3 is a bit more integrated than RStan. In RStan, you end up writing Stan code as an alien, wrapping the foreign syntax in quotes, and then shoveling the code as a string into Stan. There isn't really an R grammar for Stan as near as I can tell.

Lofkin · on June 1, 2015

Aside from the written in python making it more natural and extensible and amenable to messing around with the models, pymc 3 can sample directly from discrete parameters and STAN cannot. Pymc3 has more sampling routines.

mikecb · on June 1, 2015

Just to note, if you like Stan it has a python lib just as much as an R one, and Julia, MATLAB, etc (Stan itself is written in c++).

As for vis, yhat wrote a python version of ggplot:http://ggplot.yhathq.com/

graffitici · on June 1, 2015

xlwings looks absolutely amazing. Thanks for pointing it out!

hessenwolf · on June 1, 2015

Python is better for programming, R is better for statistics. A lot of data analysis packages require a bit of both, so I usually use a bit of both, and C++ underneath when something has to run faster.

To expand, R has every cool statistical model (outside finance) to hand. It's deadly buzz. It's finicky as hell to program in though. Classes are a bit crap. Brackets are all over the shop. Arrows? Really? Meh. Statistical models take ages to program and test, though, and the academic paper that was published last week has had the model as an R package for 6 months.

Python is a beautiful programming language, especially for anything involving transferring data from here to there and possibly back again. I particularly like it when working with a diverse team of context experts, who would think nothing of laying a faecal nugget in your code-base, to save the time it takes to drink a cup of coffee. It's a lot harder to write crap code in Python than in many other languages.

So, in our local technical landscape, R is used for prototyping, and is used in a slave fashion for running complex stats models. Python is used as the "Controller" of the program, and also for database type operations, web-scraping, calling QuantLib, ...

baldfat · on June 1, 2015

This is an unnecessary fight.

They both are great. I personally strongly prefer R over Python due to libraries and the flexibility. Others hate the flexibility of R and like Python's only one way to correctly do something.

It comes down to preferences. I like the R is for statistical specifically. There are people who call R an aweful horrible language and it isn't. It has grown incredibly over the last 5 years and has major funding from many corporations for a reason.

If it seems passionate for or against either language it is more an emotional response to the question. R works best for me and Python might be someone else's prefer solution.

Myrmornis · on June 1, 2015

I agree, I like R and python for certain things. (python for most things and R for statistics and visualization).

However, I really don't agree with the "X works best for me but Y might work better for someone else so let's not fight" argument. Doesn't that imply that the question "Which is better, X or Y"? shouldn't be asked? On the contrary, I think that's often an important question when you need to practically get stuff done, choose between libraries, etc, and I want to know people's opinions.

dragonwriter · on June 1, 2015

> Doesn't that imply that the question "Which is better, X or Y"? shouldn't be asked?

No, it just implies that there is, in the domain in which it is applied, no objectively correct answer.

baldfat · on June 1, 2015

Just be warned I named my son Soren for the founder of Existential Philosophy Soren Kierkegaard. There is no spoon.

baldfat · on June 1, 2015

Here is my opinion on Python and R. I used Pandas and switched to R.

1) Python 0 based as opposed to 1 based is big to me. I don't want to always have to minus on on columns to call them.

2) I much prefer Hadley Wickham's syntax to Pandas which closely follows it, but R piping with %>% is amazing with library magritr. Using dplyr and stringr, tidyr, rvest, httr, luberdate are awesome and consistant.

3) In R I can use dplyr whcih is fast or even use data.table for even more speed that uses a sql like syntax. In Python your choices are limited in comparison to R for tools.

4) HUGE and the reason I switched to R. I can output my graphics into a editable PowerPoint slide. My managers and company doesn't know that there are anything besides MS Office and I can give them slides that they can fiddle with for their own presentations.

5) RMarkDown and knitr for outputing reports in Word (UGH but ti works great), PDF or , LaTex or HTML is fantastic. Reproducable output is great. If I want I can also use Project Jupyter (IPython's new name in next version) with a R kernel.

6) Tools - I got RStudio installed on my work computers, and my personaly server has RStudio Server installed. This is what I have settled into using because it is the best statistic programing tool out there. Works better than Jupyter/IPython for 90% of the time, though I love Love LOVE Jupyter/IPython.

7) The community. R's community is fantastic [http://www.r-bloggers.com/] just look at all the daily information, new libraries and tutorials. The books are great in R, (I really enjoyed the Python Pandas Book by Wesly) Hadley Wickham's Newest book is avilable in Physical form for purchase and free on the internet [http://adv-r.had.co.nz/]

8) Tutorials - Datacamp, Swirl in R tutorial, Coursera Courses, etc. are great and have just about anything you would like to learn.

9) The libraries and install system with Cran installs [http://cran.r-project.org/] and devtools to install GitHub libraries install and updated on your system [https://github.com/hadley/devtools]

I could go further but those are my opinions to why I like R so much.

Lofkin · on June 1, 2015

Ultimately Julia combines the strengths of both and much more. It is the future IMO, not R or python.

hueving · on June 1, 2015

way to miss the point. Part of a language selection for a real -world project is based on the libraries available for it. If I'm doing systems administration programming, there is a much higher probability of there being a python library to interact with BGP or namespaces than there is for Julia.

Selecting Julia blindly is just as stupid as selecting R or python blindly. Language features are only a small part of the reason you choose a language for a large project.

apl · on June 1, 2015

The Julia FFI for Python is absolutely excellent -- calling a particular Python library from Julia takes very, very little effort. As a language for scientific programming, Julia is way ahead of Python. So I'm not sure if this particular argument holds much water.

https://github.com/stevengj/PyCall.jl

baldfat · on June 1, 2015

I think it is a chicken or the egg problem with Julia right now. Julia is very much a positively viewed language that isn't being used much.

There are not enough libraries (I looked 3 months ago) for me to do my work in Julia.

There isn't enough users to make the libraries for my use case.

Lofkin · on June 9, 2015

What is missing? If its something like shiny, Julia is well on way to obviating that problem: https://shashi.github.io/Escher.jl/

kermatt · on June 1, 2015

Depends on what you want. R has arguably better coverage for packages that provide various statistical tools, but Python is arguably a better general purpose language.

jtth · on June 1, 2015

Python: We Still Can't Do Generalized Linear Mixed Models, So Call R.

tadlan · on June 1, 2015

Or use pymc 3 for the bayes equivalent. It's Better anyways. Yes one can also use rpy2. I think glmm is a gsoc project or an upcoming pr but not 100% sure.

graffitici · on June 1, 2015

I realize this is sarcasm; what's the underlying point? That there are no GLMM libraries in the Python world?