> Python is obviously a big part of data science these days
Off topic, but why is that? I mean, Python doesn't seem any more suitable to "data science" than Ruby or Lua or Clojure or any number of other similar languages. So why do I keep hearing that Python is super popular with scientists?
From personal experience, having programmed in a small number of languages C/C++, Java, COBOL, Pascal, Python, Ruby, various shells, Assembly (Intel) etc, there are a few things that keep attracting me back to Python:
a) Somehow the fact that indentation is meaningful forces my code to be neat automatically without having to run any beautifier through it. I consider myself a bit OCD on how the code looks (can't explain it) and I like how Python code finally looks. I like code to be maintainable and self-explanatory without leaving puzzles for my successors to solve.
b) Batteries included. Granted there are libraries for most things in other languages, but nothing with the ease of access as Python.
c) Portability without having to know the compiler intricacies of each Port/Platform/OS
d) Being able to deliver solutions on time and within budget without putzing around with irrelevant stuff that the customer never cares about.
Not being a Data Scientist, I don't know if any of this is relevant. But being educated in CS and responsible for delivering on promises I care about some of this a lot.
a) Having "beautiful code" helps keep focus where it matters: the "difficult" sutff (be data data science, machine learning, math problems, physics simulations, etc).
b) True, there are beautiful libraries all around, and easy to use. In particular to data science, I think NumPy+SciPy+matplotlib are comparable to MATLAB. With beautiful code, portability and no need to pay an expensive license.
c) Totally agree. Though there are small nitpicks in different platforms, most of the times they can be overcome.
d) Totally agree.
Also: the Anaconda distribution solves most of the problems that anyone could have with (b), (c) and (d).
It's the libraries like NumPy, Pandas, SciPy, Matplotlib, SymPy, and so on that make the difference. There's been a ton of work put into developing those Python frameworks for "data science" which don't exist to such an extent in languages like Lua.
A lot of those actually started as wrappers to existing C/C++/other libs, something that is very easy to achieve in Python and not so much with other runtimes.
To be sure, NumPy is the third implementation (Numeric, numarray, NumPy) of the basic array data type, so getting it right isn't as easy to achieve even in Python.
And when library X and unrelated library Y are wrapped, they often both use Numpy arrays, making them interoperable.
E.g. if you do geographical work with raster data, there is GDAL, a raster library with Python wrappers. Because it exposes the raster data as Numpy array, you can easily make images from them to overlay on maps with matplotlib, or analyze them using scipy. In C/C++ it would probably be a lot more hassle to combine them like that.
I don't think the libraries are the only reason python has a strong following -- after all, perl also has a huge collection of libraries. I realize there's a healthy dollop of subjectivity in all this, but here are what I think the reasons for Pythons success.
1. Python uses words where many other languages use symbols. In addition, the words Python uses are simple and clear. Instead of '!', '&&' and '||', python has 'not', 'and' and 'or'. Often constructs that look like function calls in other languages look more similar to english in python. For example: if 'Python' in names: print("Found it!") .
2. Python uses indentation to control scoping. The merits of whitespace sensitivity are of course open for debate, however, I think it pushes the code closer to what pseudocode looks like, and this is probably a good thing.
3. Python has a simple and coherent object model. Unlike Lua or Javascript, python has a real object system (i.e. not prototype-based) with inheritance. Although object oriented programming doesn't get a lot of love on HN, it does have several benefits, compared to, for example, functional programming. One advantage is that it arguably well-understood and knowledge about OO has already been widely disseminated. Another advantage of OO is that the transition of using structs to hold mere data (a la C or Pascal) to OO is a simple matter of adding function to the struct. With OO, you can start with code that is more procedural, and work your way towards a design that is more OO. This is a path that makes sense in the context of scientific programming, since one often starts with a short programming that, say, implements an algorithm in a single function. (Then you build on it in an OO way by allowing loading input data, etc.)
5. Numerous well-defined, easily understood, and easily accessible customization points. For example, in Python, allowing the user to write "x + y" is as simple as implementing the "__add__" magic function for the class in question. Ability to overload the mathematical operators is critical for a language used in a scientific context, and this for example essentially eliminates Javascript. It also eliminates lisp in all its variations due to its lack of infix syntax (short of writing custom reader macros). The story is similar with decorators.
6. Python is easy to extend and embed. For some reason, everyone seems to rave on about how great Lua's C API is. Frankly, having written C++ extensions for both Lua, Python, and Java, the one I found easiest was Python's (especially -- but not necessarily -- using boost.python), followed by Java (especially using JNA), and Lua I found quite painful due to the explicit manipulation of the VM's stack.
7. Others have mentioned it, but the libraries are obviously a huge part of Python's success. As others have talked about it, I won't say anything more about it here.
6 is a very good point - Lua has advantages for embedding, but its API is not one of its better points.
Of course, when coding in C, your options are limited - Python's reference-counting isn't much fun either! But code using the Python API tends to be relatively readable, even though you run the risk of forgetting to put a decref in. Lua API code on the other hand, for all that it doesn't need anything like Python's incref/decref, has a tendency to be rather inscrutable.
People in scientific computing started using Python the mid-1990s. This included Numeric (ancestor to NumPy) which Jim Fulton, Jim Hugunin and others started in 1995. Ruby 0.95 wasn't released until the end of that year. Moreover, van Rossum tweaked Python so it would be a better fit for matrix computing, such as multi-dimensional slices.
Then there was PyFort by Paul Dubois at Lawrence Livermore National Lab (1999), and SWIG by Dave Beazley at Los Alamos National Lab, which had Python support by 1998 (see https://web.archive.org/web/19981212033200/http://www.swig.o... ). Those made it much easier to access existing scientific libraries through Python modules. (A phrase at the time was that Python would 'steer' the low-level high performance code.)
While at this time, Ruby was just becoming known in the English speaking world, and didn't really hit the mainstream until Ruby on Rails in 2005. This means Python had a 5-10 year head start, and Ruby hasn't caught up.
Since the data science folks don't also have a goal of becoming programmers, a language that had some early design goals of being easy to teach (see also CP4E [0]) and has appropriate "good enough" libraries lets them build custom analysis software without having to spend nearly as much time learning programming instead of more relevant things.
The early design goal of being easy to teach also caused python to have a culture that places a high value on doing the work to make the developer experience better for beginners. You can see this in things like the Django tutorial, but it is more powerfully felt as an undercurrent in conversations at python conferences (based on PyTennessee, PyConUK, DjangoCon, and PyCaribbean)
Fun bit of trivia: Hadley Wickham (of R's ggplot2 and dplyr fame, among many other things) said that if R hadn't been around when he first got into statistical programming, Ruby would have been his next choice, because of Ruby on Rails. He says now, his choice would be Python or Julia ("or maybe JavaScript"):
I'm not a scientist, but I think the answer would be orthogonal to why I (and probably many others) pick Python.
Is it tied to a single source?
No, many implementations exist and with the 2/3 split, the BDFL is weaker than ever (a good thing, overthrow all BDFLs).
Is it portable across platforms?
Yes. It runs anywhere C89 will.
Does it have a broad range of uses?
Yes. Already in largescale use for systems, web, administration scripting, game scripting, 3d applications, data modeling, game creation, crossplatform GUI applications including iOS/Android.
Is it easily maintainable?
Yes, as much as anything else. Nothing is a cure for the complexity issues at 500KLOC scale, not static typing, nothing.
Will it be around for a long time?
Since 1991, so likely in some form yes. The syntax has proven too popular with folks already. I'm considering creating own Pythonic language.
Is it popular, to get help easily and borrow code?
Yes. #3 on Github (practical comparison) and #5 on Tiobe (abstract comparison).
The language probably had something to do with it, at least in that scientists got involved very early on[0] and language features were added as a result.
numpy, scipy, scikit-learn, matplotlib, pandas, biopython, sage, ipython and ipython notebook — now jupyter — (with both matplotlib and clusters management integrations), non-C accelerators (numba, cython), anaconda (a science and engineering semi-proprietary bundling & package manager).
Many of these have their roots in the late 90s or early 00s. I guess Python got its fangs (haha) into scientific computing at the turn of the century, from the late 90s you find articles mentioning the introduction of scripting languages as glue in scientific pipelines variously mentioning Perl, Python and Tcl, and by and large it seems to have slowly accreted around Python.
> AFAIK, they trace their roots to numeric routines on Fortran punch cards.
Oh yes they can bind to much older libraries (e.g. BLAS), I was talking about the packages themselves. For instance Numpy is the unification of Numeric and Numarray, the former having gotten started circa 1995[0] on matrix-sig[1] and the latter originally indented to replace the former. So numpy itself has its roots in 1995.
It's in a sweet spot. It's good enough for unstructured exploratory analysis, for production code, it's super easy to get started in, and there are some good libraries. There's really low overhead to poking around data in it, and you can put something into production with it.
As well as the packages that people have mentioned, there are nice bundles like Anaconda or Canopy which makes it easier to get started.
Certainly working with people who know data but don't necessarily know much Python (even if they know it they might not know tools like pip, or want to spend time understanding the versions of their tools) it's way easier to say "just install anaconda" than "install python and all these packages we use".
I don't know if other languages have these bundles but Python's are very good.
Additionally, the jupyter/ipython notebook environment is very good, but I'm not sure that's converting people, I still see many researchers working in plain text editors.
They are all fine languages. There is another (key) reason why Python is keeping its momentum: the availability of skilled, professional python developers with a lot commercial experience.
I've been called in a number of times to help out small businesses or teams that've started using Python because they were: researchers, scientists, or even just regular developers, and have found themselves with a growing business and customer requirements. Having a large, professional body of capable of (especially contract, here in London) developers makes a world of difference. You can find expert Julia, Haskell, Clojure, etc. developers -- but they are fewer and far between.
I saw they ran a small "python for data science" intro course this year on edx, so I figure they're definitely behind using the language in some capacity.
I heard from a couple people that they don't exactly want to learn Fortran. Python stays out of your way and there is a major ecosystem around it by this point.
Off topic, but why is that? I mean, Python doesn't seem any more suitable to "data science" than Ruby or Lua or Clojure or any number of other similar languages. So why do I keep hearing that Python is super popular with scientists?