Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How Python became the language of choice for data science (mikiobraun.de)
175 points by davmre on Nov 22, 2013 | hide | past | favorite | 79 comments


I started out with MATLAB, played around enough with Python to realize that it wasn't enough better than MATLAB to be worth switching. (My impression was that Python was almost flexible enough as a language to make writing code that uses NumPy/SciPy feel natural, but it didn't quite achieve that.) I finally switched to Julia, which is both faster and (IMO) more enjoyable to write than either MATLAB or Python, although the current package ecosystem is pretty small. I think the main difficulty in creating a new programming language for numerical computing is that you need knowledgeable people to do it, and most of those people are more interested in analyzing their data than building a programming language (or core packages for a programming language) to do it better.

My impression on two small bits of this article:

Python is also somewhat restrictive with what you can say on a single line. In Matlab you would often load some data, start editing the functions and build you data analysis step by step, while in Python you tend to have files which you start from the command line (or at least that’s how I tend to do it).

I actually find that this kind of analysis is more conveniently done in IPython (or IJulia) Notebook than in MATLAB.

I still have dreams of a plotting library where the type of visualization is decoupled from the data (such that you can say “plot this matrix as a scatter plot, or as an image”).

This is the design goal of both R's ggplot2 and Julia's Gadfly.jl.


Similar to me. I started with MATLAB (I even worked for them for a time), but then moved to Python when I started working more in Bioinformatics and lately explore Julia a lot. The current issue with Julia is that most of the tutorials and docs available on the web are targeted at developers, not users, and therefore has a hard time converting the more casual MATLAB and, to a lesser degree, Python users. The major advantage of MATLAB over any of the other languages is its integration with the MATLAB Desktop and the casual simplicity for non-programmers. I don't really have to worry about data types, it seems to be working out of the box (which is an advantage for beginners but a serious handicap at a later stage). Julia has a higher initial barrier to adoption with its many different paths to success). Though I suspect that especially with its macros julia will be able to create extremely intuitive and expressive packages for all kinds of problem domains.

I think a good start for julia would be to include more workflow-oriented tutorials in the docs that also provide a path through the current mess of packages (220 at current count, with sometimes quite redundant functionality, e.g., Gadfly, Winston, pyplot, Gaston). Heck, maybe I should just help with it once I'm done with my current project.


Speaking as someone who has a lot of work invested in the numeric Python system I share a lot of the same sentiment. I'm unhappy with how lot of the current thinking is centered around building essentially large embedded DSLs to invoke C libraries. Although it's a technique that works and has been successful, it's become a big pain to maintain and extend over the years. Just look at how much work has been spent just trying to get simple things like NA values into NumPy.

My recommendation to other numeric Python devs is that if you don't have any investment in a specific numeric Python library think about moving your project to Julia, I suspect the ecosystem will bloom pretty rapidly and provide a much more future-proof set of tools for technical computing.


One of the nice things in Python is that it is nit limited to scientific computation.

I remember the days when I tried to do command line automation in Matlab, or tried to read some XML from some internet resource.

This stuff is super simple in Python, but very hard in Matlab, because Matlab focuses so much on science and so little on general usefulness.

Python is not as integrated with regards to the scientific core functionality, but it doesn't limit you to it, either.


Yes, for many things, Python is better than MATLAB. But the choice of language is no longer dichotomous, and Julia manages to be better for command line automation than Python. A pipe operator combined with custom interpolation for command literals gives you really beautiful syntax:

  fname = "/tmp/myfile with spaces"
  f = open(fname, "w")
  run(`echo hi` |> f)
  close(f)
  pipe, process = readsfrom(`cat $fname`)
  cmdoutput = readall(pipe)
  println(cmdoutput)
I'd write all of my command line scripts in Julia, except the interpreter currently takes far too long to start. But people are working on that.


Matlab leaves something to be desired for general programming tasks. If you can, it's sometimes nice to define stages and pipe between them. I love Python for text processing and XML, but I find Matlab way easier for shoving matrices around. I liked using the strengths of both by pre-processing raw text and XML files through a Python script, spitting out a CSV or tab-delimited file, and slurping that up in Matlab.

Matlab has come a long way though. Coming from a functional programming background, I was delighted to find that you can pass functions as arguments to some matrix operators and get some very terse map+reduce operations.


There's Oct2py and mlabwrap which lets you call octave and matlab functions from python. I've used Oct2py a few times just like you describe and it worked fine. Read, parse and format my data in python, passing it to some matlab functions (which happend to work in octave) and then formatting and writing the results back in python.


I have used MATLAB a little bit - it was indeed more concise for matrix math than Python + Numpy.

But pure matrix math is a small part of my analyses and workflow. Most of the code is in data preprocessing from SQL servers, CSV files, JSON files, and various cleanups. Once you have your data in an array, MATLAB can be great, but writing general-purpose code (or even a while loop) was horrible.


My only problem with the Julia language is that it requires code be written without vector notation in order to be fast:

In general, unlike many other technical computing languages, Julia does not expect programs to be written in a vectorized style for performance. http://docs.julialang.org/en/latest/manual/arrays/

In Fortan, MATLAB and Python, the opposite is true: vector notation is fastest.


I'm quite certain Mikio knows about ggplot2.

Ggplot2 and friends are great, but theres plenty of room for even better tools in the data vis space.


Definitely true, especially for >3D data sets.


Do you have any recommendations wrt getting started with Julia? It's a language I've wanted to explore a bit, but I haven't been able to find any good community resources or jumping-off points.


This "Learn X in Y minutes" snippet [1] showing a bunch of examples in Julia is the best starting point. I always return to this one as it is usually much faster to find what you're looking for than searching the docs.

1: http://learnxinyminutes.com/docs/julia/


I wrote Learn Julia in Y minutes. Is there anything unclear or that you'd find useful to have included/changed?


I might not be the best person to ask, since I don't generally have the attention span for tutorials and usually prefer to learn by reading/writing code. With that said, the manual (http://docs.julialang.org/en/release-0.2/manual/) is quite good.


This thread led me to search for updated information about Julia's maturity. I came across a pretty informative post from one of the devs and posted it at https://news.ycombinator.com/item?id=6782015.


I'm confused. Is python really so far ahead of the competition to be considered the "language of choice for data science?"

My impression (and I am not a statistician/data scientist in my day job, so I would very much like to hear opposing perspectives) is that the R ecosystem is far more mature and widely adopted than scikit-learn for things like regression, classification, clustering etc.

The author also cites expensive MATLAB licenses as a driving force behind python adoption, but here too I'm skeptical. As a grad student, I get MATLAB for free. But I switched to R/pandas for data analysis because R has a native data structure for working with multidimensional datasets (i.e. data.frame).

To illustrate the utility of this, let's say you asked developers all over the US for their zipcode and salary and recorded the results in salary.poll.data. Here's an interesting question: what is the mean salary in each zipcode? In R, all sorts of libraries make this computation concise and highly readable. Using the excellent data.table package, you would do `salary.poll.data[, list(mean.salary.by.zip.code=mean(salary)), by=zipcode]`.

No such libraries exist in popular usage for MATLAB. You'd have to roll your own, or more likely, write a lot of crufty loops and conditional statements. (Or use higher order functions with map/reduce/filter, which, by the way, you would have to implement yourself).

For me at least, just having the right data structures for working with data makes R/pandas a clear winner for doing statistical analysis of data.


My impression is that R and MATLAB (and Julia) certainly still have their advantages. But, Python with pandas/scikit-learn/matplotlib is almost as good at R at data munging and exploration, and with numpy/scipy/Cython, as good or better than MATLAB at complicated matrix calculations. Meanwhile, Python has its own unique features, like iPython notebooks. So it's at least competitive with R and MATLAB at a feature level, if maybe not for every possible use case.

The thing that draws me towards Python is that it's a well designed, general purpose programming language. The syntax is sensible, the language allows real OO and FP abstractions, you have easy access to basic data structures like lists and hashmaps, and there's a huge ecosystem of third-party libraries to build on. Things that are stupidly difficult in MATLAB and R, like file/network IO, string processing, or building a GUI or web interface, are straightforward in Python. If you've ever tried to run MATLAB on a cluster, it's an absolute nightmare. You would never, ever think about building a production system in MATLAB or R. But these things are easy in Python. There's something that's just really nice about having access to first-class analytics tools in the same language that you're building your systems in.


The syntax is sensible, the language allows real OO and FP abstractions, you have easy access to basic data structures like lists and hashmaps, and there's a huge ecosystem of third-party libraries to build on

Other than the last one, all of these are available in Matlab and R as well; and all three have a huge set of libraries -- the difference is about the focus of those libraries imo. Python does have more general-purpose libraries, just as R has more statistical packages.

I have certainly run a lot Matlab on a cluster, in fact, I'd say safely about 70% of code I see running on clusters around here is Matlab. I've also seen a few production systems in (mostly) R -- I actually suspect that at least on Windows, deploying an R system may be easier, since you just install R and then use internal functionality to get packages you need; python (with all necessary extensions) seems relatively tricky to get running.

All that being said, I do agree that if you are building a production system where most of the code is related to interfacing with other systems, or GUI's etc., and the data analysis is a small and non-interactive part (ie. no data exploration), Python is a very reasonable choice if you can keep the whole project in it.


I think your point is very fair in that the author is exaggerating. But,

"highly readable"

Are we talking about the same language? The only reason R ever saw use is it's adoption by the statistical community, but all other things considered, R is a shitty language.

-Awfully difficult syntax

-Extremely high learning curve

-Nightmare to debug (error and warning messages are the most cryptic I've ever worked with)

-Some of the worst documentation I've ever encountered

-Many transformations are not visible to the user (how special characters are handled for example)...

-The largest set of statistical libraries...but have you ever tried using any? Good luck! Next to nonexistent documentation for most, and even with a background in mathematical statistics, I still have a hard time following examples

-More often than not, lack of backwards compatibility

I'm a statistician by training and data scientists by day job, and have used R for years. I really admire what some have tried to do with the language (Hadley). But, it's a result of the statistical community being light years behind on computational training and is by no means a good language; it just happened to come along at the right time and had some really genius initial developers.

Frankly, the "language of choice for data science" simply doesn't exist yet. You just use what works best for the problem at the time, and more often than not that involves switching between many languages.


I just wanted to interject here that this limitation is no longer true in Matlab. Take a look at dataset[1] and grpstats[2] in the Statistics Toolbox. They make data frame-style computations much nicer. I was banging my head against Matlab control structures until I found those.

  means_by_zip = grpstats(ds, {'zipcode'})
[1] http://www.mathworks.com/help/stats/dataset.html [2] http://www.mathworks.com/help/stats/grpstats.html


FYI, they finally included a dataframe like data type in base MATLAB as well and called it `table`:

http://blogs.mathworks.com/loren/2013/09/10/introduction-to-...

So you don't need a stats license for that anymore. The question is how many functions will accept this datatype, which is one of the key problems with some of the more advanced datatypes (timeseries has similar issues).


Oh, very nice! It's been about a year since I've worked in Matlab on a daily basis, so it's good to see they brought that into base.

I didn't have much trouble with datasets because it's fairly painless to convert to a matrix with double(ds) if you setup your text columns as ordinals and nominals rather than cell arrays.


In the post it says "next to R", so I don't really see a contradition here. Both have their pros and cons. I think any data scientist should know how to use both.

R:

- has more ml libraries.

- is more mature than panadas (look at how pandas handles categorial data).

Python:

- has more general purpose libraries (e.g. language detection, html boilerplate extraction, query apis)

- can handle out of core learning (datasets, which don't fit into memory)

- can run in production.

- is language of choice for deep learning


> look at how pandas handles categorial data

Could you expand on this point?


> My impression (and I am not a statistician/data scientist in my day job, so I would very much like to hear opposing perspectives) is that the R ecosystem is far more mature and widely adopted than scikit-learn for things like regression, classification, clustering etc.

Python scores over R in one aspect - text processing.


Perl is better at that than Python.


why I get a "-1" on this? Have anyone compared processing text between Perl and Python? I really cannot agree that python is superior to Perl at this task according to my daily work. Some language is popular doesn't mean it's good at everything.


Because in a discussion of "a or b?", a comment on the usefulness of c is irrelevant. If you were attempting to propose that perl is a better data science language for reasons including its text processing, then you needed to say that. If you weren't attempting to do that, then nobody cares about perl in this context.


not (a > b) does not imply (b > a), particularly if there is no clearly defined ordering over a and b


Given that dude indicates that even he's an exception to his supposed rule, i think i'd like to see something more comprehensive argument to buy this particular line.

Python is definitely major player in data processing (heck i'd even just be interested in how he's defining data science), but it's definitely not the only game in town by a long shot.


I recently saw this (which goes into far more depth): http://www.talyarkoni.org/blog/2013/11/18/the-homogenization...

Summary: "Constantly switching languages is a chore, and while Python isn't the best at everything, it's the best at some things and good enough at almost everything else."


Recently indeed, that link is the first reference in the OP's article :-)


I was at the PyData conference in NYC two weekends ago. There was a huge range of academic and commercial interests there. It had a definite sense of energy and excitement about all the tools available (and being built) to expand the scientific python universe.

As a long-time Matlab user, I'm currently working on migrating my skills to Python simply because the small amount of functionality I may lose (decent IDE with integrated debugger and workspace viewer) I gain back with loads of other tools (like extensive out-of-box support for databases, way better memory management, and exciting developments for targeting GPUs and compute clusters).


Note that is says "next to R".


The first line: "Nowadays Python is probably the programming language of choice (besides R) ..."

Personally I am not ready to hand the title "langauge of choice" to Python, although it is trending that way. We should give R credit where credit is due. There is still a huge population of R users and code. Python has advantages in that the syntax is easier to understand, and so is the structure (object oriented versus scripting). What Python lacks is a simple setup. I think that once Python becomes more accessible to everyone (as far as downloading Python, setting up directories, packages, libraries, etc.), it will have huge leaps in usage.


I've heard good things about the anaconda distribution as far as setup and everything goes for scientific computation purposes. Comes with most packages you'd need: https://store.continuum.io/cshop/anaconda/

I don't actually have it myself, but it does seem to be trying to solve the problem you point out.


This is amazing! I'm happy I came across this comment. I've always been frustrated installing python packages as a python newbie. in R (in R studio) it was always just search and click, and, suddenly, "I know Kung-Fu" like Neo in the matrix, except, err, "I know ggplot2."


That's great! If this can match RStudio as an IDE, then Python is heading in the right direction. Thanks for showing me this.


Anaconda is a fabulous way of installing a majority of Python tools you'd need for data science. It also includes Spyder - a Matlab-like IDE. Before you go back to RStudio, you should at least check out the IPython Notebooks style workflow. It grows on you. PyCharm is anothe IDE I've been looking at - it has a free community edition too.


in the short term, you should check out spyder

http://code.google.com/p/spyderlib/

we ship it with anaconda


People are always telling me why I should jump from Python to Node. This is basically my argument against doing so.

Python, Matlab, Fortran, Cobol, will be around for a VERY long time because so many of the smartest people THINK in these languages. The number and quality of people who think in a language is more important than the number who develop in it.

I don't yet think in Python. It is not where I learned programming. I am more a Lisp thinker, but for practical application python is a better choice.

I don't trust people who think in JavaScript. Or rather I don't like to bet on them.


Well, I for some practical application lisp/racket could be a better choice. If we categorize lisps/schemes/haskells as "not for real world use" too easily we won't enjoy nearly as much innovation.

I also believe it's undeniable that parts of these languages would be a godsend in some more "real world" languages.

There are two sides to the argument, however I'd like to caution against dismissing languages as "not for real world use" too quickly since it's a trend I've seen.


I've been using Python for what I guess you could consider as pre-processing for data science.

One thing people might want to consider before investing time in Python is that I found it to be quite memory inefficient: data structures take up a lot of space, and the garbage collection didn't seem to be as effective as other languages (I spend some time studying/improving GC in JVMs). So if you're dealing with large amounts of data and/or complex data structures, I wonder if Matlab might be more appropriate (AFAIK R is also not very good at memory management yet).


As a rule of thumb, if you run out of memory in python, you will also run out of memory in matlab. I've done a lot of work in both and found that while python's memory performance may not be ideal, at least when you run into trouble in python you have options. With matlab I found that when I got that dreaded "Out of memory" message the prompt, there was little I could do. The internals are completely opaque, shipping my code to C is a pain in the ass, and there are very few language constructs to help you control how you use memory.

In most cases running out of memory in matlab meant either making the problem smaller or running it on a beefier machine. I think this is the reason why you see a lot of labs at universities with machines that have 96GB of memory, even though their datasets seem to be much smaller.

FWIW, as far as processing lots of data is concerned, python is not without issues. If you do it naively you will run out of memory really quickly. But by picking your tools correctly you can go a long way. Use Pandas and/or Sparse arrays whenever possible. Learn how numpy broadcasting operations contribute to memory explosions. Take a gander at the source of that sklearn method you're using, since it's often quite obvious that the particular implementation will choke.

I've found that these days I try my best to avoid loading datasets into memory. This is second nature for people who work with 'big data', but it's an m.o. that takes some getting used to. That is, blocking and/or streaming your data, and appropriately subdividing your problem for distributed computation. It's worth mentioning that this problem with python is under active research and development. The guys at continuum developed IOpro to deal with the issue of memory efficiency when loading data, and to make streaming data from flat files/S3/mongodb/whatever easier and more stable. Also, their (very young) project called Blaze is meant to be a drop-in replacement for numpy, but is designed for efficiency and specifically for dealing with out-of-core computation. We'll see...


Great advice, thanks.

I'm not experienced enough in using Matlab, but I did see a seminar given by some of the engineers and they seemed to have given a lot of thought to optimising the software for large datasets.

> I've found that these days I try my best to avoid loading datasets into memory. This is second nature for people who work with 'big data', but it's an m.o. that takes some getting used to.

Yes, couldn't agree more. I'm still getting used to working this way.


I also recently learned that Panda's can talk to a PyTable's HDF data storage on disk. This should be fairly seemless and can help reduce some memory issues with larger datasets. Yves Hilpisch did a nice talk on "Performance Python" that goes into this: http://hilpisch.com/YH_Performance_Python_Slides.html


Yes, I almost exclusively use pytables+HDF5 when I'm permitted. As I commented on some thread earlier this week, it strikes a great balance between simplicity, performance and flexibility.


> ...I found it to be quite memory inefficient: data structures take up a lot of space...

Just for whatever it's worth, this is the whole point of numpy. Lists, dicts, etc are not memory-efficient, but numpy arrays are.

The main reason I switched from Matlab to Python is due to Matlab's excessive memory usage (Essentially every operation makes a copy). Using numpy, you have a lot more control over memory usage than you do with Matlab.


This is not a major problem for data science. We usually use numpy arrays (or scipy sparse arrays), which are memory efficient. There are also some other efficient data structures for python:

http://kmike.ru/python-data-structures/

For preprocessing you can usually work on a line-by-line basis (e.g. don't keep the entire dataset in memory).

This is a bigger problem for web apps, imo. But there you usually just throw money (e.g. more servers) at the problem.


Have you looked at pytables? It does a good job of presenting a fairly pythonic interface dealing with datasets that don't fit in memory. Many standard operations are basically the same as when dealing with in-memory numpy structures.


I had terrible memory problems using Python on larger graphs (1M+ V/E). Weird memory usage happened and even 64GB filled up fast.

For example, the everything is a dictionary approach that widely used NetworkX library utilizes was a horrible memory gobbler on large graphs. It sounds very nice in theory, works well on smaller graphs but is near useless for larger ones.

The best graph tool that I found for larger graphs was graph-tool which is heavily NumPy based.

When I write in C, at least I know where my memory goes.


It's not.

R vs some Python alternatives in Google Trends: http://bit.ly/1fpOR57

Pandas is looking good, but then we have Julia, Matlab and probably some people still using SAS.

edit:

Comparing statistical packages by: Kaggle.com usage, Job posts, Activity on blogs, mailing lists, Stack overflow, Google Scholar and a few more..

http://r4stats.com/articles/popularity/


I must be missing something...I see the Python options having a lot more volume than R. What am I supposed to be seeing in this graph that means Python isn't a (if not the) leading language in this space?


I guess there are better ways of comparing. I like the survey data and kaggle.com data from my other link.


Comparing statistical packages by: Kaggle.com usage, Job posts, Activity on blogs, mailing lists, Stack overflow, Google Scholar and a few more..

http://r4stats.com/articles/popularity/

We have a winner?


2011 was ages ago


The article actually admits in the first sentence that the title is false: "... Python is probably the programming language of choice (besides R) ..."


I updated with a link to a post comparing statistical packages in a much better way.


Would be interesting with more recent data on Kaggle.com usage. Anyone?


Octave, R, Julia, Python.. we have so many choices now.


I think that's the real benefit. In the past 5 years, there's been an explosion of available tools, which is good for everybody. I looked at Python a while back, and it wasn't there yet. Now it seems to arrived.


Any sufficiently fast language that allow the user to focus on the problem rather than the syntax or semantics will do better than its peers. I for one am looking forward to Julia maturing.


I consider myself a data scientist in bioinformatics field. I have to deal with several hundred GB scale data everyday. IMO, the best tool kit so far is the combination of Perl and R. Because these tools have the richest packages/modules, you can do almost everything with them. As for python, I don't think it can deal with the data I have as efficient as Perl.


PDL or Perl?


PDL is the Perl packages for numerical science. Unless you did a lot of mathematics computation, PDL is rarely used in my daily routine.


Both Octave/Matlab and R are more "convenient" or elegant in certain precise cases, like linear algebra, but Python provides a much more coherent experience due to syntax and the type system.

Julia... well... has its merits, but the ecosystem isn't comparable. Julia does not have Django, Guido or PyCon.


I think it's just a matter of time until it replaces Python in data science. (And I believe it's also the best designed general purpose dynamic language out there.)


It would be a matter of quite a long time then. Considering that it took the numpy/pandas combination several years to almost overtake matlab and R, and that Python is language and interpreter with 2 decades of history and development, Julia should take about five to ten years to be a serious competitor, and the inertia of the python community will be hard to break.


Existing Python libraries can be called quite easily from Julia using PyCall (https://github.com/stevengj/PyCall.jl)

Regarding ecosystem development, one advantage that Julia has is the low learning curve between "user" and "contributor". Idiomatic code usually runs reasonably quickly, and can be made "fast" by tweaking some things within the same language (devectorizing, or turning off bounds-checks, for example). These kinds of optimizations simply cannot be done in NumPy without dropping in to C+CPython or Cython (each of which has impedance mismatches and language+toolchain hurdles).


Perhaps, it's hard to tell... Julia can be almost as fast as C, so it can potentionally replace many uses of C / C++ for scientific computing.


Julia can be almost be as fast as C and still not have Django, Guido or Pycon...


The lead developer on Elefant got poached by a high frequency trading shop in 2006. He was "discovered" showing his wares at pycon that year (vectorized operations faster than anything else out there).

Sigh.


Hadoop? Java

HBase? Java

Hive? Java

RapidMiner? Java

Cassandra? Java

Neo4J? Java

Python 4 the win.


Python excels in the exploratory data analysis side of things, not so much on the computation side. Thus we get tools like Pandas, NumPy and Scikit-Learn.


Python, to me, is the all-around B+ language. It has good (but not great) performance, language semantics, runtime technical detail. As history showed us, being good at a lot of things enabled the growth of a large community (without corporate sponsorship) and an immense library ecosystem. So "all-around B+" isn't a knock against it. If anything, it's to be admired.

Unfortunately, this means that most data science findings, when transmuted into permanent production programs, don't stay in Python. Often Java or C++ are used instead.

Personally, that's why I think Clojure's got a real shot at being "the language of choice for data science" in 2023. It has the power of Lisp, it makes a lot of data-frame manipulations really easy, and because it sits on top of the JVM, it can be "productionized" pretty easily (you might have to write a couple functions in Java, for performance).


After spending at least 50+ hours programming in both Julia and Clojure I have to disagree. Julia is more readable, faster, great for both imperative and functional programming and generally I found it easier to get things done.


While it is very fun to write Clojure, it must be said that it is relatively difficult to read, which hinders collaboration.


Partly because it became a the language of choice for teaching CS basics instead of Scheme in good schools like MIT.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: