Practical Data Science in Python

gallamine · on Feb 5, 2015

It might be useful to those not familiar with it, but this blogpost was written using IPython Notebooks - you can code, plot and then render to HTML all in the browser. Most of my data science work is done using this format. If Python isn't your language of choice, there are lots of plugins for Python Notebook to let you effectively do in-browser REPL with plotting and documentation: http://ipython.org/notebook.html

It's changed the way I work (and blog)

spot · on Feb 6, 2015

If you like that you might consider Beaker Notebooks which have some more advanced UI elements and better polyglot support: http://BeakerNotebook.com

You can combine multiple languages in the same notebook, and they can communicate.

otoburb · on Feb 6, 2015

First time stumbling across Beaker Notebook. Looks great! Recently heard about Jupyter[1][2], which is a project to generalize the IPython backend to be language agnostic. Are the various Two Sigma financial services entities using Beaker internally? Seems like a natural fit for the highly mathematical nature of the work Two Sigma staff and entities are engaged in.

Interesting tweet from one of the IPython/Jupyter devs[3] last year about Beaker. Jupyter and Beaker state similar missions, although Beaker seems to have focused more on "multiple languages within a single notebook" leveraging the IPython backend. Sounds like the two projects can both co-exist due to the slightly different emphasis influencing each project's trajectory.

[1] http://jupyter.org/

[2] https://speakerdeck.com/fperez/project-jupyter

[3] https://twitter.com/Mbussonn/status/481108119633035264

baldfat · on Feb 6, 2015

Jupyter is coming BUT I have zero idea why the community around iPython still doesn't know that iPython will only equal the Python back end and Jupyter will be the over all project.

This switch is happening in iPython 3.0 and still the lack of public information is concerning for me. It feels that Jupyter has loss steam????

jasongrout · on Feb 6, 2015

Jupyter definitely hasn't lost interest - the IPython team is currently spending a lot of time finishing up IPython 3.0 and seeking funding (grants, etc.). IPython 3.0 is due to be released soon now (3.0 beta 1 was tagged a few days ago, and the development is somewhat in a soft freeze in preparation for 3.0). The general plan is to split up the IPython repo and transition to a Jupyter repo after 3.0 is released, and have a relatively quick iteration to 4.0 (mainly consisting of the repo split).

devilsdounut · on Feb 6, 2015

Interesting. Any advantages of this over IPython? I see fancy rendering of DataFrames, but there are libraries that let you do this in IPython.

spot · on Feb 6, 2015

Thanks.

Beaker has autotranslation for communicating among languages, and you can have multiple languages in one notebook.

Four months ago someone asked for compare and contrast: https://news.ycombinator.com/item?id=8366245

Since then we have fixed a ton of bugs and fleshed out the concept. The reflection API and JS scriptability are about to be released, along with a bunch of UI polish, performance, and all kinds of fixes.

There are only 2-3 of us working for 2 years so far. The project is still young and developing, and in fact we are hiring! Especially we need a Javascript Engineer, Angular experience a plus: http://www.twosigma.com/careers/position/935.html

IanCal · on Feb 6, 2015

This looks really interesting, thanks. The communication is great, often I want to munge the data / collect things in python, then graph them in R.

spot · on Feb 6, 2015

To respond more directly to your question:

Yes you can get better tables with IPy if you remember to load a lib and call a function. With Beaker it just works by default.

Ditto for sharing, Beaker has one-click sharing to the web built in, with IPy you have to load an extension.

On the Mac, Beaker comes packaged as a native app that you just drag to Applications and the Dock.

spot · on Feb 6, 2015

PS: we have a prerelease of v1.2 available, including a docker container. see the last question at the bottom of the page on our FAQ for the download links: http://beakernotebook.com/faq

calebm · on Feb 6, 2015

I love IPython Notebook too. I use it for most of my analysis work.

gaius · on Feb 6, 2015

Back in the 90s we used a program called MathCAD which did all this interactive notebook stuff, in MS Word.

gh02t · on Feb 6, 2015

MathCAD is still around, it is at least pretty popular in my field (nuclear engineering). I actually hate it to be honest, but I think it is good for people who are less inclined towards programming.

That said, I think Mathematica does a much better job of notebook style programming. You can do some truly amazing things manipulating the Mathematica notebook. The language itself is also pretty nice, something like APL flavored lisp with M-expressions instead of S-expressions. It isn't without its flaws, but it is one of my favorite tools in the toolbox (along with Python, C++, Haskell and Fortran).

gaius · on Feb 6, 2015

This may be prior to Mathematica getting the notebook - we were also using MATLAB and FORTRAN on Vaxes :-)

lqdc13 · on Feb 6, 2015

It's great for some things, but bad for others. If you're using Firefox for example, and the notebook is calculating something, it will lock up all Firefox windows. Also, if you are printing a lot, the html rendering is very slow and consumes a lot of CPU.

Great for previewing graphs and copy pasting and executing things out of order.

filmor · on Feb 6, 2015

The first part regarding Firefox is not true. Calculation is completely asynchronous in the IPython notebooks, i.e. if you run a cell that doesn't output anything your browser is idling.

Even if it has one pretty large output it works quite well. However, if you `print` each intermediate result and have a few hundred of those you get a problem indeed, there is a need for some kind of overflow protection here.

lqdc13 · on Feb 6, 2015

I'm not getting that Firefox issue now for some reason. I'll file a bug report if it happens again.

neverminder · on Feb 6, 2015

There's this thought constantly bugging me - Python is popular among data scientists, but it also happens to be quite a slow language (roughly speaking) in comparison to the likes of Java or Go for instance. Hypothetically speaking, would it not be more beneficial to use something like Rust instead?

IanOzsvald · on Feb 6, 2015

Don't forget that the higher level functionality (e.g. the scikit-learn routines Radim uses) are typically wrappers for underlying C/Fortran routines and they're the real bottleneck. The relatively few lines of VM'd Python are 'slow' compared to e.g. C but aren't the bottleneck.

The win with Python (and other dynamic languages) is that you can experiment quickly with ideas when you're formulating a solution, that's a big part of exploratory data science.

If you're curious about high-speed work in Python - Radim did a blog series on how he sped up word2vec to be faster than Google's original C code: http://radimrehurek.com/2013/09/deep-learning-with-word2vec-...

I'll also note [self promo!] that I wrote on book on High Performance Python, if that's your cup of tea (and Radim wrote a section in it): http://shop.oreilly.com/product/0636920028963.do

Radim · on Feb 6, 2015

(tutorial author here) Good answer, and I can only recommend Ian's book!

I cut the marketing speak down to minimum in my articles and tutorials, but if you're interested in cutting edge machine learning & no-nonsense data mining, get in touch! I run a world class consulting company, http://radimrehurek.com.

danieldk · on Feb 6, 2015

The win with Python (and other dynamic languages) is that you can experiment quickly with ideas when you're formulating a solution, that's a big part of exploratory data science.

And in my experience, very hard to reproduce after a couple of years. With enough discipline, it's obviously possible to make well-structured Python programs that will last. But in practice that rarely happens with scientific software written in Python. Usually, there are many external dependencies, it's fragile (no static type checking), and platform-dependent (usually OS X or Linux). To add to the mess, most scientists like to hardcode paths to the input data, etc.

Although I am not a fan of Java, I usually don't encounter the same problems with older scientific Java software. If it's Mavenized you are usually ready to go after a 'mvn compile', otherwise, you just dump the project structure in an IDE and it usually works.

(The plague with scientific software in Java is that it is often not thread-safe.)

Also, I think the quick experimentation is not limited to Python and statically typed languages with a REPL can also provide that (Haskell, OCaml, Scala). And since Go was mentioned: since compilation time in Go is usually near-zero, it's the same.

gh02t · on Feb 6, 2015

> And in my experience, very hard to reproduce after a couple of years.

Well, let's be honest with ourselves... this isn't limited to Python. Scientific code that isn't a mess is almost nonexistent. For a lot of scientists, writing code is totally secondary and many simply aren't skilled programmers (nor should we necessarily expect them to be).

It is however deeper than that. As a graduate student, I was involved in a government initiative to write a high quality large scale code package. This was (still is, the program just got extended) a well funded and well organized effort with hundreds of people, including literally dozens of people who can legitimately claim to be the best in the world at their specialties. This included some genuinely amazing computer scientists and software engineers who enforced well planned coding practices.

And yet, the code is still far from ideal. A big part of this is its scale - millions of lines of very technical numerics code and libraries all working together. Most of what I consider to be the toughest work was on integrating various disparate pieces and unifying them under one common input structure.

Point being, even with effectively unlimited resources using rigorous development standards and statically typed languages (primarily c++11) there are still tons of issues. A lot of it is because of incorporation of older codes, which is inescapable in any non-trivial scientific code.

bkcooper · on Feb 6, 2015

I'll also note [self promo!] that I wrote on book on High Performance Python

I've really enjoyed this book so far, so thanks!

IanOzsvald · on Feb 7, 2015

Glad you enjoyed it :-) If you have a moment, leaving a review (e.g. on Amazon) would be most appreciated (there's a dearth of views as it is a bit of a niche subject!)

olavgg · on Feb 6, 2015

Nice! Just bought your book! :)

dr_zoidberg · on Feb 6, 2015

Also you've given some amazing talks in various PyCons!

IanOzsvald · on Feb 6, 2015

Much obliged (assuming that's directed at me!) - Radim's started with some rather nice talks too :-)

lqdc13 · on Feb 6, 2015

Most of the libs that you use are written in C with Python bindings, so they're not that slow. It's only slow if you are implementing a new algorithm without using numpy/scipy to do matrix operations.

jabl · on Feb 7, 2015

In my experience, when you're doing technical computing you spend a lot of time exploring your data (or some representative test data set) and doing quick one-off analyses. So an interactive environment with good plotting capabilities etc. is invaluable. See the success of things like Matlab, R, and yes, python (particularly IPython + the rest of the Scipy stack).

Secondly, since a lot of techical computing involves multidimensional arrays, you want good support for them in the language. Which means some kind of array syntax such as Matlab, R, python/numpy, etc., and also that they are efficiently handled behind the scenes (one array instead of nested arrays somewhat popular in C code).

So in the end, there's not a whole lot to choose from if you're not willing to sacrifice any of the two above features. One language I'm excited about, Julia, is a bit special in that it tries to combine the high productivity of such high-level interactive environments with C/Fortran-like high performance. The language itself is really nice, IMHO, but of course the surrounding ecosystem is so far much less mature than that around scientific python.

That being said, I'm also excited about Rust and I hope it will have a bright future, also in technical computing. Though I believe where Rust would be most useful, compared to Julia, say, is for writing low level libraries that can then be used from any language with a C FFI, as Rust doesn't require a big runtime with GC and whatnot.

onderkalaci · on Feb 6, 2015

For the ones who are seeking for data science in Python, that is great. Thanks for sharing!

undergrowth54 · on Feb 6, 2015

If you're coming from web development and used to using virtualenv, anaconda has environment management too. Run $(conda install conda-env). You can still pip install things into conda environments too. you'll probably want to $(conda install binstar) and search for various packages with that don't come in stock anaconda. For example, you can $(conda install --javascript node)

undergrowth54 · on Feb 6, 2015

That should be $(conda install --channel javascript nodejs).

mslate · on Feb 6, 2015

I gave a strikingly/humorously similar talk at a meetup in Boston ~1.5 years ago:

http://nbviewer.ipython.org/github/mmautner/email_classifier...