I'm confused. Is python really so far ahead of the competition to be considered ...

davmre · on Nov 22, 2013

My impression is that R and MATLAB (and Julia) certainly still have their advantages. But, Python with pandas/scikit-learn/matplotlib is almost as good at R at data munging and exploration, and with numpy/scipy/Cython, as good or better than MATLAB at complicated matrix calculations. Meanwhile, Python has its own unique features, like iPython notebooks. So it's at least competitive with R and MATLAB at a feature level, if maybe not for every possible use case.

The thing that draws me towards Python is that it's a well designed, general purpose programming language. The syntax is sensible, the language allows real OO and FP abstractions, you have easy access to basic data structures like lists and hashmaps, and there's a huge ecosystem of third-party libraries to build on. Things that are stupidly difficult in MATLAB and R, like file/network IO, string processing, or building a GUI or web interface, are straightforward in Python. If you've ever tried to run MATLAB on a cluster, it's an absolute nightmare. You would never, ever think about building a production system in MATLAB or R. But these things are easy in Python. There's something that's just really nice about having access to first-class analytics tools in the same language that you're building your systems in.

xixi77 · on Nov 22, 2013

The syntax is sensible, the language allows real OO and FP abstractions, you have easy access to basic data structures like lists and hashmaps, and there's a huge ecosystem of third-party libraries to build on

Other than the last one, all of these are available in Matlab and R as well; and all three have a huge set of libraries -- the difference is about the focus of those libraries imo. Python does have more general-purpose libraries, just as R has more statistical packages.

I have certainly run a lot Matlab on a cluster, in fact, I'd say safely about 70% of code I see running on clusters around here is Matlab. I've also seen a few production systems in (mostly) R -- I actually suspect that at least on Windows, deploying an R system may be easier, since you just install R and then use internal functionality to get packages you need; python (with all necessary extensions) seems relatively tricky to get running.

All that being said, I do agree that if you are building a production system where most of the code is related to interfacing with other systems, or GUI's etc., and the data analysis is a small and non-interactive part (ie. no data exploration), Python is a very reasonable choice if you can keep the whole project in it.

mendicantB · on Nov 22, 2013

I think your point is very fair in that the author is exaggerating. But,

"highly readable"

Are we talking about the same language? The only reason R ever saw use is it's adoption by the statistical community, but all other things considered, R is a shitty language.

-Awfully difficult syntax

-Extremely high learning curve

-Nightmare to debug (error and warning messages are the most cryptic I've ever worked with)

-Some of the worst documentation I've ever encountered

-Many transformations are not visible to the user (how special characters are handled for example)...

-The largest set of statistical libraries...but have you ever tried using any? Good luck! Next to nonexistent documentation for most, and even with a background in mathematical statistics, I still have a hard time following examples

-More often than not, lack of backwards compatibility

I'm a statistician by training and data scientists by day job, and have used R for years. I really admire what some have tried to do with the language (Hadley). But, it's a result of the statistical community being light years behind on computational training and is by no means a good language; it just happened to come along at the right time and had some really genius initial developers.

Frankly, the "language of choice for data science" simply doesn't exist yet. You just use what works best for the problem at the time, and more often than not that involves switching between many languages.

drunkpotato · on Nov 22, 2013

I just wanted to interject here that this limitation is no longer true in Matlab. Take a look at dataset[1] and grpstats[2] in the Statistics Toolbox. They make data frame-style computations much nicer. I was banging my head against Matlab control structures until I found those.

  means_by_zip = grpstats(ds, {'zipcode'})

[1] http://www.mathworks.com/help/stats/dataset.html [2] http://www.mathworks.com/help/stats/grpstats.html

sveme · on Nov 22, 2013

FYI, they finally included a dataframe like data type in base MATLAB as well and called it `table`:

http://blogs.mathworks.com/loren/2013/09/10/introduction-to-...

So you don't need a stats license for that anymore. The question is how many functions will accept this datatype, which is one of the key problems with some of the more advanced datatypes (timeseries has similar issues).

drunkpotato · on Nov 22, 2013

Oh, very nice! It's been about a year since I've worked in Matlab on a daily basis, so it's good to see they brought that into base.

I didn't have much trouble with datasets because it's fairly painless to convert to a matrix with double(ds) if you setup your text columns as ordinals and nominals rather than cell arrays.

ma2rten · on Nov 22, 2013

In the post it says "next to R", so I don't really see a contradition here. Both have their pros and cons. I think any data scientist should know how to use both.

R:

- has more ml libraries.

- is more mature than panadas (look at how pandas handles categorial data).

Python:

- has more general purpose libraries (e.g. language detection, html boilerplate extraction, query apis)

- can handle out of core learning (datasets, which don't fit into memory)

- can run in production.

- is language of choice for deep learning

kermatt · on Nov 25, 2013

> look at how pandas handles categorial data

Could you expand on this point?

tchalla · on Nov 22, 2013

> My impression (and I am not a statistician/data scientist in my day job, so I would very much like to hear opposing perspectives) is that the R ecosystem is far more mature and widely adopted than scikit-learn for things like regression, classification, clustering etc.

Python scores over R in one aspect - text processing.

lightoverhead · on Nov 22, 2013

Perl is better at that than Python.

lightoverhead · on Nov 22, 2013

why I get a "-1" on this? Have anyone compared processing text between Perl and Python? I really cannot agree that python is superior to Perl at this task according to my daily work. Some language is popular doesn't mean it's good at everything.

jacalata · on Nov 22, 2013

Because in a discussion of "a or b?", a comment on the usefulness of c is irrelevant. If you were attempting to propose that perl is a better data science language for reasons including its text processing, then you needed to say that. If you weren't attempting to do that, then nobody cares about perl in this context.

pekk · on Nov 22, 2013

not (a > b) does not imply (b > a), particularly if there is no clearly defined ordering over a and b