It might be useful to those not familiar with it, but this blogpost was written using IPython Notebooks - you can code, plot and then render to HTML all in the browser. Most of my data science work is done using this format. If Python isn't your language of choice, there are lots of plugins for Python Notebook to let you effectively do in-browser REPL with plotting and documentation: http://ipython.org/notebook.html
First time stumbling across Beaker Notebook. Looks great! Recently heard about Jupyter[1][2], which is a project to generalize the IPython backend to be language agnostic. Are the various Two Sigma financial services entities using Beaker internally? Seems like a natural fit for the highly mathematical nature of the work Two Sigma staff and entities are engaged in.
Interesting tweet from one of the IPython/Jupyter devs[3] last year about Beaker. Jupyter and Beaker state similar missions, although Beaker seems to have focused more on "multiple languages within a single notebook" leveraging the IPython backend. Sounds like the two projects can both co-exist due to the slightly different emphasis influencing each project's trajectory.
Jupyter is coming BUT I have zero idea why the community around iPython still doesn't know that iPython will only equal the Python back end and Jupyter will be the over all project.
This switch is happening in iPython 3.0 and still the lack of public information is concerning for me. It feels that Jupyter has loss steam????
Jupyter definitely hasn't lost interest - the IPython team is currently spending a lot of time finishing up IPython 3.0 and seeking funding (grants, etc.). IPython 3.0 is due to be released soon now (3.0 beta 1 was tagged a few days ago, and the development is somewhat in a soft freeze in preparation for 3.0). The general plan is to split up the IPython repo and transition to a Jupyter repo after 3.0 is released, and have a relatively quick iteration to 4.0 (mainly consisting of the repo split).
Since then we have fixed a ton of bugs and fleshed out the concept. The reflection API and JS scriptability are about to be released, along with a bunch of UI polish, performance, and all kinds of fixes.
There are only 2-3 of us working for 2 years so far. The project is still young and developing, and in fact we are hiring! Especially we need a Javascript Engineer, Angular experience a plus: http://www.twosigma.com/careers/position/935.html
PS: we have a prerelease of v1.2 available, including a docker container. see the last question at the bottom of the page on our FAQ for the download links: http://beakernotebook.com/faq
MathCAD is still around, it is at least pretty popular in my field (nuclear engineering). I actually hate it to be honest, but I think it is good for people who are less inclined towards programming.
That said, I think Mathematica does a much better job of notebook style programming. You can do some truly amazing things manipulating the Mathematica notebook. The language itself is also pretty nice, something like APL flavored lisp with M-expressions instead of S-expressions. It isn't without its flaws, but it is one of my favorite tools in the toolbox (along with Python, C++, Haskell and Fortran).
It's great for some things, but bad for others. If you're using Firefox for example, and the notebook is calculating something, it will lock up all Firefox windows. Also, if you are printing a lot, the html rendering is very slow and consumes a lot of CPU.
Great for previewing graphs and copy pasting and executing things out of order.
The first part regarding Firefox is not true. Calculation is completely asynchronous in the IPython notebooks, i.e. if you run a cell that doesn't output anything your browser is idling.
Even if it has one pretty large output it works quite well. However, if you `print` each intermediate result and have a few hundred of those you get a problem indeed, there is a need for some kind of overflow protection here.
There's this thought constantly bugging me - Python is popular among data scientists, but it also happens to be quite a slow language (roughly speaking) in comparison to the likes of Java or Go for instance. Hypothetically speaking, would it not be more beneficial to use something like Rust instead?
Don't forget that the higher level functionality (e.g. the scikit-learn routines Radim uses) are typically wrappers for underlying C/Fortran routines and they're the real bottleneck. The relatively few lines of VM'd Python are 'slow' compared to e.g. C but aren't the bottleneck.
The win with Python (and other dynamic languages) is that you can experiment quickly with ideas when you're formulating a solution, that's a big part of exploratory data science.
(tutorial author here) Good answer, and I can only recommend Ian's book!
I cut the marketing speak down to minimum in my articles and tutorials, but if you're interested in cutting edge machine learning & no-nonsense data mining, get in touch! I run a world class consulting company, http://radimrehurek.com.
The win with Python (and other dynamic languages) is that you can experiment quickly with ideas when you're formulating a solution, that's a big part of exploratory data science.
And in my experience, very hard to reproduce after a couple of years. With enough discipline, it's obviously possible to make well-structured Python programs that will last. But in practice that rarely happens with scientific software written in Python. Usually, there are many external dependencies, it's fragile (no static type checking), and platform-dependent (usually OS X or Linux). To add to the mess, most scientists like to hardcode paths to the input data, etc.
Although I am not a fan of Java, I usually don't encounter the same problems with older scientific Java software. If it's Mavenized you are usually ready to go after a 'mvn compile', otherwise, you just dump the project structure in an IDE and it usually works.
(The plague with scientific software in Java is that it is often not thread-safe.)
Also, I think the quick experimentation is not limited to Python and statically typed languages with a REPL can also provide that (Haskell, OCaml, Scala). And since Go was mentioned: since compilation time in Go is usually near-zero, it's the same.
> And in my experience, very hard to reproduce after a couple of years.
Well, let's be honest with ourselves... this isn't limited to Python. Scientific code that isn't a mess is almost nonexistent. For a lot of scientists, writing code is totally secondary and many simply aren't skilled programmers (nor should we necessarily expect them to be).
It is however deeper than that. As a graduate student, I was involved in a government initiative to write a high quality large scale code package. This was (still is, the program just got extended) a well funded and well organized effort with hundreds of people, including literally dozens of people who can legitimately claim to be the best in the world at their specialties. This included some genuinely amazing computer scientists and software engineers who enforced well planned coding practices.
And yet, the code is still far from ideal. A big part of this is its scale - millions of lines of very technical numerics code and libraries all working together. Most of what I consider to be the toughest work was on integrating various disparate pieces and unifying them under one common input structure.
Point being, even with effectively unlimited resources using rigorous development standards and statically typed languages (primarily c++11) there are still tons of issues. A lot of it is because of incorporation of older codes, which is inescapable in any non-trivial scientific code.
Glad you enjoyed it :-) If you have a moment, leaving a review (e.g. on Amazon) would be most appreciated (there's a dearth of views as it is a bit of a niche subject!)
Most of the libs that you use are written in C with Python bindings, so they're not that slow. It's only slow if you are implementing a new algorithm without using numpy/scipy to do matrix operations.
In my experience, when you're doing technical computing you spend a lot of time exploring your data (or some representative test data set) and doing quick one-off analyses. So an interactive environment with good plotting capabilities etc. is invaluable. See the success of things like Matlab, R, and yes, python (particularly IPython + the rest of the Scipy stack).
Secondly, since a lot of techical computing involves multidimensional arrays, you want good support for them in the language. Which means some kind of array syntax such as Matlab, R, python/numpy, etc., and also that they are efficiently handled behind the scenes (one array instead of nested arrays somewhat popular in C code).
So in the end, there's not a whole lot to choose from if you're not willing to sacrifice any of the two above features. One language I'm excited about, Julia, is a bit special in that it tries to combine the high productivity of such high-level interactive environments with C/Fortran-like high performance. The language itself is really nice, IMHO, but of course the surrounding ecosystem is so far much less mature than that around scientific python.
That being said, I'm also excited about Rust and I hope it will have a bright future, also in technical computing. Though I believe where Rust would be most useful, compared to Julia, say, is for writing low level libraries that can then be used from any language with a C FFI, as Rust doesn't require a big runtime with GC and whatnot.
If you're coming from web development and used to using virtualenv, anaconda has environment management too. Run $(conda install conda-env). You can still pip install things into conda environments too. you'll probably want to $(conda install binstar) and search for various packages with that don't come in stock anaconda. For example, you can $(conda install --javascript node)
It's changed the way I work (and blog)