One Page R: A Survival Guide to Data Science with R

Permit · on April 3, 2014

I was surprised at the difficulties I had using R at my previous job. It's simultaneously a language I enjoyed, and one I'd feel hesitant to recommend a friend because of the flood of complaints I'd have directed at me as they learned it.

I love the concept of vectorized operations, but God almighty did I struggle with the differences between indexing a list with list[1] vs list[[1]]. Other hangups included the differences between lapply, sapply, apply or the difficulty with which one has applying operations to the rows of a dataframe. Other frustrations like naming conventions are apparently results of legacy code.

I'd strongly recommend the book The R Inferno[1] by Patrick Burns. It eases you through the nine circles of R hell that most newcomers seem to endure.

pvnick · on April 2, 2014

Does anybody with experience doing analysis in both R and python have any insight as to whether one can replace the other, or do they significantly supplement each other? I get the impression that python with pandas/scikit-learn/scipy/numpy/matplotlib can be used almost completely in lieu of both R and matlab, but I don't feel that I have enough experience in R or matlab to make such a claim.

chubot · on April 2, 2014

I am a pretty die hard Python person, having used it as my main language for over 10 years. I think there is a tendency to make that claim because we want it to be true :)

Python is such a nice language that we want to use it for everything. But I found I became more productive once I got out of that mindset. Right now, I have a very multilingual workflow including Python, R, shell scripts, C++, and a handful of other DSLs (various sql dialects, html templating, etc.)

I feel like I am finally productive working with data; I was always astounded by how MUCH code you have to write to work with data. There is a pretty big cost to learning all of that, and I won't lie and say I learned R quickly, but it was worth it.

I use Python to interface with MANY systems, often to generate cleanly formatted CSV files. R reads the CSV files and does data slicing and dicing (using the data.table package, an enhanced data frame). And then ggplot2 for plotting. And C++ for the big data stuff, and shell scripts to glue it all together (with concurrent processes, importantly).

I find that this Unix-style architecture is actually more maintainable in the long run. Parts of your analysis and data cleaning become more modular and reusable. There are a lot of reasons why a boatload of Python packages aren't very cleanly reusable. You also end up with MUCH less code using the multi-language strategy vs. trying to do it all in one language (I've seen people try to do that both with Python and R).

ggplot2 is unrivaled; Python and Julia people are busily trying to copy it. Python and Julia are both also copying R's data frame structure. The area is moving extremely quickly now, so I'll be interested to see what progress they make. But there are actually areas where R the language is more suited to data analysis (has more of a Scheme core than Python, more appropriate data structures, lazy expression evaluation, etc.)

In the end I am using R for the same reason I chose to use Python over C/C++ -- because it enables you to write 5x less code (in the domain of data analysis, compared with Python). And once you use other tools, you see Python's flaws, like bad package management (R's is actually better), and a relatively bad REPL. And Python's C++-inspired class system is often a hindrance for data analysis; R is a more data-oriented language.

R definitely has its problems, including a lot of horrible R code out there. But it also has a lot more books and so forth. IMO data.table + ggplot2 alone make it worth it. I recommend reading Hadley Wickam's papers on "tidy data" and the "split-apply-combine" style of analysis. In that sense learning R actually helps you learn how to do data analysis properly.

craigching · on April 3, 2014

Thanks for the "tidy data" reference, getting that now. Have you explored dplyr yet, since you mentioned "split-apply-combine" and you like data.tables?

chubot · on April 3, 2014

I'm familiar with dplyr but haven't really used it. I prefer using stuff that's mature, and I'm sure that dplyr is going to undergo a lot of evolution in the early days.

I independently came to the same conclusion as this guy: http://blog.datascienceretreat.com/post/69789735503/r-the-go...

That is, "use everything from Hadley, except use data.table instead of plyr". This was before dplyr came out, so maybe that could change. But I kind of like the syntax of data.table, although I don't understand all of it.

Of course plyr is ridiculously slow and can't be used for even modest-sized data sets.

craigching · on April 3, 2014

> Of course plyr is ridiculously slow and can't be used for even modest-sized data sets.

Right and that's what dplyr is supposed to fix. The benchmarks so far mark it at roughly the same performance as data.table. But, as you said, it's early days for dplyr ;) Thanks for your comments!

whyenot · on April 2, 2014

It depends. R has many, many more packages and those packages tend to be more extensively peer-reviewed. If the sort of analysis you are doing wanders out of the mainstream, you are less likely having to code something up from scratch than is the case with python/*. I also find R is a lot more enjoyable to use interactively and for exploring data than python. On the other hand, python is fantastic for larger pieces of code, data wrangling, and any tasks where you can make use of python's massive number of (non-statistical) modules.

zmmmmm · on April 2, 2014

> those packages tend to be more extensively peer-reviewed

This is the key as far as I can see. I am fortunate enough to work closely with some big names in statistics. They do everything in R, there's not a thought of using another language. All the leading research in statistical methods gets published in R, thus the "reference" implementations become those R packages. Even if you find an equivalent implementation somewhere else, you can't say it's been through the rigour of peer review that the R version has.

greenburger · on April 2, 2014

For data munging and sorting/organizing and basic analysis I would say that matlab/python easily match R or even exceed (especially performance-wise) it's capability. The true strength of R is in the available third-party packages. There are over 5000 available with pretty much any statistical algorithm you can think of implemented by at least one.

bernardom · on April 2, 2014

This discussion seems to happen pretty often on HN. (Or maybe that's just because I click on any frontpage story with R in the title)

Anyways, I found this to be a good discussion: https://news.ycombinator.com/item?id=7030097

shoyer · on April 3, 2014

My experience is that R more often has advanced statistical packages you might want, but Python (particularly pandas) can be much faster. R's ggplot is also way better than anything in the Python ecosystem.

IPython's notebook actually makes it very easy to go back and forth between R and Python: http://nbviewer.ipython.org/github/ipython/ipython/blob/3607...

I actually use this pretty routinely so I can do my data-analysis in Python and my plotting using R's ggplot. It's a little awkward, but it works.

Fomite · on April 2, 2014

It really depends on what you want to do. For a lot of machine learning tasks and bread-and-butter statistical tools, you can get away with it. For a lot of more advanced statistics, it's going to be much, much harder.

It's also a question of reliability - Python might have what you need, but if it's a statistical problem, R is much much more likely to have it.

AdrianRossouw · on April 3, 2014

I don't have much of a background in maths or statistics, but from my experiences mucking about with R... I know that python is going to be a hell of a lot more accessible to me as a programmer.

catwork · on April 3, 2014

The fact that you are a programmer is exactly the reason R feels funny. Its predecessor (S) was originally just supposed to be an interactive interface for some Fortran libraries. It turned out to be useful to folks doing mathematical applications and grew into a language.

Most introductions to R don't cover it in the same way other languages do. I found http://adv-r.had.co.nz/ to be very helpful for clarifying matters in a manner that is more familiar to programmers.

marrone12 · on April 2, 2014

R is really kind of a frustrating and inconsistent language, but the wealth of obscure packages plus plyr/dplyr make it an ease to just get stuff done.

craigching · on April 2, 2014

Wow, this is a really nice resource, especially for someone like me just getting started. The PDFs appear, at first glance, to be quite in-depth. Wondering if they include using tools like plyr/dplyr.

austinl · on April 3, 2014

This is definitely a good starting resource. Some of the tutorials on R can a bit too "academic" for some tastes, so its helpful to have a guide on working with specific data sets.