I'm confused. Is python really so far ahead of the competition to be considered the "language of choice for data science?"
My impression (and I am not a statistician/data scientist in my day job, so I would very much like to hear opposing perspectives) is that the R ecosystem is far more mature and widely adopted than scikit-learn for things like regression, classification, clustering etc.
The author also cites expensive MATLAB licenses as a driving force behind python adoption, but here too I'm skeptical. As a grad student, I get MATLAB for free. But I switched to R/pandas for data analysis because R has a native data structure for working with multidimensional datasets (i.e. data.frame).
To illustrate the utility of this, let's say you asked developers all over the US for their zipcode and salary and recorded the results in salary.poll.data. Here's an interesting question: what is the mean salary in each zipcode? In R, all sorts of libraries make this computation concise and highly readable. Using the excellent data.table package, you would do `salary.poll.data[, list(mean.salary.by.zip.code=mean(salary)), by=zipcode]`.
No such libraries exist in popular usage for MATLAB. You'd have to roll your own, or more likely, write a lot of crufty loops and conditional statements. (Or use higher order functions with map/reduce/filter, which, by the way, you would have to implement yourself).
For me at least, just having the right data structures for working with data makes R/pandas a clear winner for doing statistical analysis of data.
My impression is that R and MATLAB (and Julia) certainly still have their advantages. But, Python with pandas/scikit-learn/matplotlib is almost as good at R at data munging and exploration, and with numpy/scipy/Cython, as good or better than MATLAB at complicated matrix calculations. Meanwhile, Python has its own unique features, like iPython notebooks. So it's at least competitive with R and MATLAB at a feature level, if maybe not for every possible use case.
The thing that draws me towards Python is that it's a well designed, general purpose programming language. The syntax is sensible, the language allows real OO and FP abstractions, you have easy access to basic data structures like lists and hashmaps, and there's a huge ecosystem of third-party libraries to build on. Things that are stupidly difficult in MATLAB and R, like file/network IO, string processing, or building a GUI or web interface, are straightforward in Python. If you've ever tried to run MATLAB on a cluster, it's an absolute nightmare. You would never, ever think about building a production system in MATLAB or R. But these things are easy in Python. There's something that's just really nice about having access to first-class analytics tools in the same language that you're building your systems in.
The syntax is sensible, the language allows real OO and FP abstractions, you have easy access to basic data structures like lists and hashmaps, and there's a huge ecosystem of third-party libraries to build on
Other than the last one, all of these are available in Matlab and R as well; and all three have a huge set of libraries -- the difference is about the focus of those libraries imo. Python does have more general-purpose libraries, just as R has more statistical packages.
I have certainly run a lot Matlab on a cluster, in fact, I'd say safely about 70% of code I see running on clusters around here is Matlab. I've also seen a few production systems in (mostly) R -- I actually suspect that at least on Windows, deploying an R system may be easier, since you just install R and then use internal functionality to get packages you need; python (with all necessary extensions) seems relatively tricky to get running.
All that being said, I do agree that if you are building a production system where most of the code is related to interfacing with other systems, or GUI's etc., and the data analysis is a small and non-interactive part (ie. no data exploration), Python is a very reasonable choice if you can keep the whole project in it.
I think your point is very fair in that the author is exaggerating. But,
"highly readable"
Are we talking about the same language? The only reason R ever saw use is it's adoption by the statistical community, but all other things considered, R is a shitty language.
-Awfully difficult syntax
-Extremely high learning curve
-Nightmare to debug (error and warning messages are the most cryptic I've ever worked with)
-Some of the worst documentation I've ever encountered
-Many transformations are not visible to the user (how special characters are handled for example)...
-The largest set of statistical libraries...but have you ever tried using any? Good luck! Next to nonexistent documentation for most, and even with a background in mathematical statistics, I still have a hard time following examples
-More often than not, lack of backwards compatibility
I'm a statistician by training and data scientists by day job, and have used R for years. I really admire what some have tried to do with the language (Hadley). But, it's a result of the statistical community being light years behind on computational training and is by no means a good language; it just happened to come along at the right time and had some really genius initial developers.
Frankly, the "language of choice for data science" simply doesn't exist yet. You just use what works best for the problem at the time, and more often than not that involves switching between many languages.
I just wanted to interject here that this limitation is no longer true in Matlab. Take a look at dataset[1] and grpstats[2] in the Statistics Toolbox. They make data frame-style computations much nicer. I was banging my head against Matlab control structures until I found those.
So you don't need a stats license for that anymore. The question is how many functions will accept this datatype, which is one of the key problems with some of the more advanced datatypes (timeseries has similar issues).
Oh, very nice! It's been about a year since I've worked in Matlab on a daily basis, so it's good to see they brought that into base.
I didn't have much trouble with datasets because it's fairly painless to convert to a matrix with double(ds) if you setup your text columns as ordinals and nominals rather than cell arrays.
In the post it says "next to R", so I don't really see a contradition here. Both have their pros and cons. I think any data scientist should know how to use both.
R:
- has more ml libraries.
- is more mature than panadas (look at how pandas handles categorial data).
Python:
- has more general purpose libraries (e.g. language detection, html boilerplate extraction, query apis)
- can handle out of core learning (datasets, which don't fit into memory)
> My impression (and I am not a statistician/data scientist in my day job, so I would very much like to hear opposing perspectives) is that the R ecosystem is far more mature and widely adopted than scikit-learn for things like regression, classification, clustering etc.
Python scores over R in one aspect - text processing.
why I get a "-1" on this? Have anyone compared processing text between Perl and Python? I really cannot agree that python is superior to Perl at this task according to my daily work.
Some language is popular doesn't mean it's good at everything.
Because in a discussion of "a or b?", a comment on the usefulness of c is irrelevant. If you were attempting to propose that perl is a better data science language for reasons including its text processing, then you needed to say that. If you weren't attempting to do that, then nobody cares about perl in this context.
My impression (and I am not a statistician/data scientist in my day job, so I would very much like to hear opposing perspectives) is that the R ecosystem is far more mature and widely adopted than scikit-learn for things like regression, classification, clustering etc.
The author also cites expensive MATLAB licenses as a driving force behind python adoption, but here too I'm skeptical. As a grad student, I get MATLAB for free. But I switched to R/pandas for data analysis because R has a native data structure for working with multidimensional datasets (i.e. data.frame).
To illustrate the utility of this, let's say you asked developers all over the US for their zipcode and salary and recorded the results in salary.poll.data. Here's an interesting question: what is the mean salary in each zipcode? In R, all sorts of libraries make this computation concise and highly readable. Using the excellent data.table package, you would do `salary.poll.data[, list(mean.salary.by.zip.code=mean(salary)), by=zipcode]`.
No such libraries exist in popular usage for MATLAB. You'd have to roll your own, or more likely, write a lot of crufty loops and conditional statements. (Or use higher order functions with map/reduce/filter, which, by the way, you would have to implement yourself).
For me at least, just having the right data structures for working with data makes R/pandas a clear winner for doing statistical analysis of data.