dplyr 0.7.0

yomritoyj · on June 14, 2017

As the article "Tidy Evaluation"[1] says

"R makes it easy to create DSLs thanks to three features of the language:

* R code is first-class. That is, R code can be manipulated like any other object (see sym(), lang() and node() for creating such objects). We also call expressions these objects containing R code (see is_expr()).

* Scope is first-class. Scope is the lexical environment that associates values to symbols in expressions. Environments can be created (see env()) and manipulated as regular objects.

* Finally, functions can capture the expressions that were supplied as arguments instead of being passed the value of these expressions (see enquo() and enexpr()). "

dplyr is a testament to how these nonstandard features of R can be combined to produce a language for interactive data analysis whose convenience and clarity cannot be matched by traditional languages.

[1] http://rlang.tidyverse.org/articles/tidy-evaluation.html

peatmoss · on June 14, 2017

Absolutely! R's lispy legacy shines through in places.

minimaxir · on June 14, 2017

dplyr is the main reason to use R for tabular data manipulation over things like Python/pandas. (see the introduction for more background: https://cran.r-project.org/web/packages/dplyr/vignettes/dply... )

The key feature in this case is tidy evaluation, which will be slightly different to use since it favors functional programming paradigms. (pull() is neat for certain use cases as well)

jeffwass · on June 14, 2017

Would you mind providing some reasons on how you find dplyr better than Python/pandas?

I'm genuinely curious, because I just started using pandas in a new job a few weeks ago, and it seems robust enough so far. I glanced through your link and didn't see any key differences.

My gripes thus far about pandas are that it seems a bit verbose sometimes, eg doing groupby's.

And I haven't quite grokked the indexing. As in - I never use indices, I always reset indices after groupby's to get the groupbys as columns. And I find multi-indices a hassle. Eg, if I group by a column and want to get a sum and count or max and min in one go, without the multiindex that requires using a tuple to access the column afterwards.

Oh, and one actually annoying one - grouping by a column that contains NaN's silently drops those rows. Not the behaviour I'd expect, and requires ensuring all groupbys are preceded by fillna's, which adds to the verbosity.

And just thought of another annoyance. Integer columns are silently turned into floats if a row has any NaN's. So your column of integer ID's turned to floats won't join with another table expecting ints (I've had to workaround and turn to strings)

Besides that, pandas seems pretty reasonable. I've found its use of masks to be pretty powerful, for instance.

j88439h84 · on June 14, 2017

The thing I've been looking for in pandas is:

df %>% mutate(x = y[2] + z[3]) %>% filter(x > 4)

I haven't found anything like this in pandas; I haven't found any of the pandas dplyr emulators able to do either of these transformations cleanly.

If anybody knows a way, I'd love to hear it. (But afaik dplython etc can't do it.)

ljw1001 · on June 14, 2017

Could you please explain what that expression is intended to accomplish so I can be sure that I understand the example?

in9 · on June 14, 2017

This example seems a little odd, since it takes a data frame and mutates the x column to be equal (all entries of the column) to the sum of the second component of the y column/vector to the 3rd component of the z column/vector. It is odd since they whole column turns into one value.

in9 · on June 14, 2017

it is quite simple in pandas, but as a said in another comment, your example is a bit weird, since it turns all values of x into the same. But this is how to do it:

df.assign(x = lambda x: x.ix[2, 'y']).query('x > 3')

j88439h84 · on June 14, 2017

Here's a better example, in dplyr and pandas. I find the pandas very awkward in both operations, and dplyr very nice.

Get the last n characters from a column string, where n is determined by another column. Then filter to rows where those characters are 'AB'.

( df .assign(last_n_chars=df.apply(lambda x: x['name'][-x['n_chars']:], axis=1)) .query("last_n_chars == 'AB'") )

df %>% mutate(last_n_chars=str_sub(name, start=-n_chars)) %>% filter(last_n_chars == 'AB')

in9 · on June 14, 2017

If you are not using indices and multi-indices you are missing out in the awesome advantages in using pandas. If you come from an R background, indices (or rownames in R) are a real hassle, and you always want to keep them in columns.

But in pandas they are highly optimized and proof tested, and a breeze to work with once you get a hang of it. They make merging dataframes easy, pivoting easy, data tidying easy, and etc.. However, since you are still learning the api, it can be a pain to use them.

et2o · on June 14, 2017

Slightly off topic, am I the only person who hates the %>% operator? I find it completely obfuscates code.

thestephen · on June 14, 2017

While I agree that it easily can be abused, I quickly got the hang of it and found it makes a lot of my code more readable. One could think of it as a "then".

  my_column %>% 
  gsub(" ","",.) %>% # Remove whitespace, then...
  gsub("[a-zA-Z]+","",.) %>% # Remove letters, then...
  strsplit("-") %>% # Split on dashes, then...
  lapply(as.numeric) %>% # Make each vector in the list numeric, then...
  lapply(mean) # calculate the mean for each list index

The alternatives here would be gigantic function calls that need to be read from the inside out, or multiple variable assignments.

in9 · on June 14, 2017

Idk why, but looking at the traditional base-R functions really hurt my eyes. The inconsistent api (like in gsub, where the string comes last) together with the unclear naming (very Unixy when you think about it) made it very hard for me when getting starting with R.

Now days I would do the same with the tidyverse equivalents:

    my_column %>% 
      stringr::str_replace_all(" ","") %>% 
      stringr::str_replace_all("[a-zA-Z]+","") %>% 
      stringr::str_split("-") %>% 
      purrr::map(is.numeric) %>% 
      purrr::map(mean)

thestephen · on June 15, 2017

Very nice idea to use the tidyverse equivalents – thanks!

in9 · on June 14, 2017

Something that I do now and then, that really screws up the readability, is when I want to join a dataframe with another, but forget to manipulate the second. And it turns to something like:

    df %>% 
        select(var_1, var_2, var_3) %>% 
        left_join(df2 %>% 
                      select(another_var1, var2),
    		  by = c('var_2', 'var2'))

This is really over using `%>%`. However, I sometimes take this pattern way too far... :D

ap53 · on June 24, 2017

I do the same thing all the time... How would you improve it? Thanks!