ARrgh: a newcomer's angry guide to R (2013)

nkrumm · on Dec 15, 2014

I gave up on R long ago. On python, pandas, numpy, scipy, scikit-learn and statsmodels are often enough to replace basic R functionality. In the case of needing an actual R package, I've found it's almost always worth the time to build a wrapper on rpy2, or use ipython notebook's rmagic extension.

On the note of building wrappers-- it's still a good idea to rpy2-wrap basic statistical tests and present in both ecosystems. The R functions are battle-tested and have been looked over by far more statisticians and mathematicians than their python counterparts (or so it seems).

hadley · on Dec 15, 2014

I gave up on Python long ago. Anything I want to do in Python, I can do just as easily in R.

Rcpp provides a great environment for intermingling high performance c++ with expressive R code. R has all th r features you'd want for a modern development environment: good IDE, unit testing, documentation conventions, ... It's easy to turn analyses into interactive apps with shiny. There's a package for every model you can think of. You can connect to databases, you can talk to web apis and scrape web pages.

jzwinck · on Dec 15, 2014

R makes it hard for teams to collaboratively develop code in libraries. No, R packages are not a solution, because each user has to "build" and "install" them every time another user checks in an update.

R doesn't have a debugger on par with Python's pdb. It also doesn't show you tracebacks by default when errors occur, which is a huge problem for mere mortals when something goes wrong.

If you're working alone, using well-tested libraries, none of the above will stop you. But if you work on a team, these are big problems (for which Python has solutions built in).

Since you mentioned connecting to databases, let's talk about using MySQL in R. I downloaded a popular package to do that, and found that the function to query the database is called "fetch()". Just "fetch()", not "mysql.fetch()", not "database_fetch()", not "connection$fetch()", but "fetch()". That kind of sloppy naming is par for the course in R, and it's a problem when your project becomes larger than yourself.

hadley · on Dec 15, 2014

Most of those problems have existing solutions.

1. Do you know about browser()? That gives you an interactive debugger on par with most programming languages. Also see the GUI wrapper to the debugger in RStudio.

2. Share a library (a collection of installed packages). I'm not sure how you're doing this with python libraries, but there's probably a straightforward equivalent in R.

3. Once MySQL implements the new the DBI 0.3.0 interface, the function name will dbFetch(). But at heart the function is named that way because R uses generic function style OO, rather than message-passing OO - it's nothing to do with sloppiness. I'd recommend reading up a bit on the advantages and disadvantages of each style of OO.

IndianAstronaut · on Dec 15, 2014

Shiny is incredible. My company might be delivering a customer facing product using Shiny.

joyofdata · on Dec 15, 2014

I do second that - http://www.joyofdata.de/blog/hierarchical-clustering-with-r/ :)

epistasis · on Dec 15, 2014

>The documentation is inanely bad. I can't explain it.

I'm surprised that the author is saying this as I've experienced exactly the opposite. R completely documents all the arguments and outputs of its functions, and documentation is easy to pull up by function, and this is almost universal both for distribution and community packages. Additionally the documentation often includes vignettes that show full examples.

In contrast, Python documentation is most often documented on long pages that mentions functions, but does not describe arguments or the output. I've found almost no Python documentation to be adequate, outside of some of the core functions. And when it is adequate, it's exceedingly verbose, and lacking in examples, basically the worst of all worlds.

mapcar · on Dec 15, 2014

I agree, I've read other gripes about R function documentation but it's one of the better ones for community software. Python's documentation seems focused on implementation from a programmer's perspective, but often not as helpful for actual application of the function.

acqq · on Dec 15, 2014

Wow, the language used to code formulas in which it's dangerous to use the single-letter variables:

http://tim-smith.us/arrgh/atomic.html

"This also means that you shouldn't ever assign useful quantities to variables named T and F. Sorry. Other variable names that you cannot use are c, q, t (!), C, D, and I."

Note the contradiction of that limitation and the name of the language. Makes the name even more exceptional.

Is he right? What's with the scope? Can't I introduce a new T in my function thus just hiding the global one from it, but otherwise not disturbing anything? (I don't know R, I'm just asking, reading that the variables have the function scope)

rcthompson · on Dec 15, 2014

Yes, all those single-letter names are just ordinary variables that you can overwrite. Doing so is nearly always a terrible idea.

The article is being a little unclear when it says "cannot use". You can use literally any variable name in R if you really want to. If the name you want is already a reserved word (e.g. "for", "else", "function"), or if it is not a syntactically valid token (e.g. '@!":%$>"@;'), then you just have to enclose it in backquotes. So the following is valid R:

    `for` <- 1:5
    `function` <- 5:1
    `TRUE` <- `for` / `function`
    `@!":%$>"@;` <- `TRUE`^2
    print(`@!":%$>"@;`)

mapcar · on Dec 15, 2014

and if you overwrite variables like "c", you can always invoke the original concatenation function as "base::c".

0942v8653 · on Dec 15, 2014

You don't have to declare a variable before you use it, so there's no such thing as "a new T".

acqq · on Dec 15, 2014

But can I declare the local t in the function? And is the t in the function parameter local?

hadley · on Dec 15, 2014

Yes and yes. R is lexically scoped

hadley · on Dec 15, 2014

It's not really as bad as all that. The reason to avoid using those names is to avoid confusing the reader, not because it causes problems for R

a_bonobo · on Dec 15, 2014

You can cause some mischief - if you write a package, or distribute a .Rprofile to colleagues, you can hide somewhere in there a

temp <- F

F <- T

T <- temp

and if your colleagues are lazy and use T and F (instead of TRUE and FALSE) fun and very gnarly to debug things are going to happen!

Nichts ist wahr! Alles ist erlaubt!

hadley · on Dec 15, 2014

There are way worse things you can do than that! e.g.

`(` <- function(x) { if (is.numeric(x) && runif(1) < 0.1) { x * 1.1 } else { x } }

;)

a_bonobo · on Dec 16, 2014

That is beautiful, thank you very much! That's the kind of "bug" people don't even notice.

weissguy · on Dec 15, 2014

If I ever strike it rich, I swear to god I'm donating $5,000,000 to the cause of reaching total feature parity between the best of R's packages and NumPy/SciPy.

otoburb · on Dec 15, 2014

If you strike it rich, would you like to hedge a bit by betting a percentage of the $5M donation on a promising horse in the race called Julia?

_almosnow · on Dec 15, 2014

I'd really like Julia to become the winner of that race.

I've used everything but Octave (sorry Stallman :[) and coming from a CS background, no other language/platform made me feel more at home than Julia.

x0x0 · on Dec 15, 2014

you are underestimating the price of that by at least an order of magnitude. That is one of the biggest reasons people like R. R also has the notion of NA, which is different than NaN, built into the language.

The other huge reason for R adoption is it makes running stat analyses very simple, so for all the people who aren't programmers, and don't wish to be programmers, R is an awesome choice. The ability to, in 3 simple lines of R, load data from a csv, run a glm, and get a sophisticated report on the model is awesome.

collyw · on Dec 15, 2014

Excel is awesome in a similar way, no?

IndianAstronaut · on Dec 15, 2014

Hadley Wikham's work has made the R environment much more tolerable and interesting. Because of it, R can be a pleasure to work with.

weissguy · on Dec 15, 2014

The battle between R and NumPy reminds me of the competition between American and Japanese auto manufacturers. Did American car companies successfully play QA catchup in the past 10 years? Probably, and God knows they needed to. Meanwhile, Japanese car manufacturers were building on 35 years of QA excellence the whole time.

I would be far more excited about the possibilities created by a cornucopia of new stats/dataviz functionality built into Python than I would be about some packages that make R a bit less terrible to write.

hadley · on Dec 15, 2014

The whole R sucks as a language thing is getting pretty old. There is some things that R is really good at, there are some things that Python is really good at. Just because R comes from a somewhat unusual ancestry (firmly rooted in functional programming with immutable data structures and generic function style OO) doesn't make it bad.

jzwinck · on Dec 15, 2014

As recently as a few months ago, R used a form of "reference counting" which only allowed three values in the counter: 0, 1, and 2 [1]. This might be caused by unusual ancestry or it might not, but it's hard to stop saying R sucks when it does that in 2014.

[1] http://stackoverflow.com/a/25384396/4323

hadley · on Dec 15, 2014

It still does that. But it's an optimization for editing in place, not related to gc

joyofdata · on Dec 15, 2014

Actually having lots of people complain about a programming language is a good indicator for that language doing well:

- JavaScript

- PHP

- C++

- ...

On that note, Hadley, R IS awesome and a large reason for that is your contribution to it! Thanks and keep up the good work!

hadley · on Dec 15, 2014

Thanks!

alilja · on Dec 15, 2014

The worst part of R is that array indices begin at 1, and trying to get an array at 0 will fail silently by returning 0. I've spent many a night trying to figure out why all my data is wrong because my_array[0] * frame$column is returning the wrong numbers.

princeb · on Dec 15, 2014

this gets brought up again and again with no agreement and the only advice I know how to give you is that when working in any language that has mathematics as its primary focus (mathematica, matlab, Julia, R), you use 1-index, and other languages you use 0-index.

redacted · on Dec 15, 2014

alilja's issue isn't 1-indexing, it's R's completely insane decision to return usable numerical values for an array access error. In Mathematica or Matlab accidentally using 0-indexing would lead to obvious errors, while in R it often doesn't - especially if you have sparse/many zero data already

hadley · on Dec 16, 2014

That's not true. If you index with 0 you get a zero length vector back

kephra · on Dec 15, 2014

The worst point of R is debugging. R scripts often fail without telling any line number, even if script starts with an "options(error=traceback)". And I never seen line numbers of warnings.

So you get errors, crashes and warning, and the only way to debug R, is to inject message() statements all over the code.

hadley · on Dec 16, 2014

You probably want options(error = recover). If you're not getting line numbers you need to upgrade R. You might find http://adv-r.had.co.nz/Exceptions-Debugging.html helpful.

joyofdata · on Dec 15, 2014

> The more you learn about the R language, the worse it will feel.

The opposite is true in my experience.

> R makes me want to kick things almost every time I use it.

Maybe R is not your biggest problem.

> The documentation is inanely bad. I can't explain it.

Good point!