"R makes it easy to create DSLs thanks to three features of the language:
* R code is first-class. That is, R code can be manipulated like any other object (see sym(), lang() and node() for creating such objects). We also call expressions these objects containing R code (see is_expr()).
* Scope is first-class. Scope is the lexical environment that associates values to symbols in expressions. Environments can be created (see env()) and manipulated as regular objects.
* Finally, functions can capture the expressions that were supplied as arguments instead of being passed the value of these expressions (see enquo() and enexpr()).
"
dplyr is a testament to how these nonstandard features of R can be combined to produce a language for interactive data analysis whose convenience and clarity cannot be matched by traditional languages.
The key feature in this case is tidy evaluation, which will be slightly different to use since it favors functional programming paradigms. (pull() is neat for certain use cases as well)
Would you mind providing some reasons on how you find dplyr better than Python/pandas?
I'm genuinely curious, because I just started using pandas in a new job a few weeks ago, and it seems robust enough so far. I glanced through your link and didn't see any key differences.
My gripes thus far about pandas are that it seems a bit verbose sometimes, eg doing groupby's.
And I haven't quite grokked the indexing. As in - I never use indices, I always reset indices after groupby's to get the groupbys as columns. And I find multi-indices a hassle. Eg, if I group by a column and want to get a sum and count or max and min in one go, without the multiindex that requires using a tuple to access the column afterwards.
Oh, and one actually annoying one - grouping by a column that contains NaN's silently drops those rows. Not the behaviour I'd expect, and requires ensuring all groupbys are preceded by fillna's, which adds to the verbosity.
And just thought of another annoyance. Integer columns are silently turned into floats if a row has any NaN's. So your column of integer ID's turned to floats won't join with another table expecting ints (I've had to workaround and turn to strings)
Besides that, pandas seems pretty reasonable. I've found its use of masks to be pretty powerful, for instance.
This example seems a little odd, since it takes a data frame and mutates the x column to be equal (all entries of the column) to the sum of the second component of the y column/vector to the 3rd component of the z column/vector.
It is odd since they whole column turns into one value.
it is quite simple in pandas, but as a said in another comment, your example is a bit weird, since it turns all values of x into the same. But this is how to do it:
If you are not using indices and multi-indices you are missing out in the awesome advantages in using pandas. If you come from an R background, indices (or rownames in R) are a real hassle, and you always want to keep them in columns.
But in pandas they are highly optimized and proof tested, and a breeze to work with once you get a hang of it. They make merging dataframes easy, pivoting easy, data tidying easy, and etc..
However, since you are still learning the api, it can be a pain to use them.
While I agree that it easily can be abused, I quickly got the hang of it and found it makes a lot of my code more readable. One could think of it as a "then".
my_column %>%
gsub(" ","",.) %>% # Remove whitespace, then...
gsub("[a-zA-Z]+","",.) %>% # Remove letters, then...
strsplit("-") %>% # Split on dashes, then...
lapply(as.numeric) %>% # Make each vector in the list numeric, then...
lapply(mean) # calculate the mean for each list index
The alternatives here would be gigantic function calls that need to be read from the inside out, or multiple variable assignments.
Idk why, but looking at the traditional base-R functions really hurt my eyes. The inconsistent api (like in gsub, where the string comes last) together with the unclear naming (very Unixy when you think about it) made it very hard for me when getting starting with R.
Now days I would do the same with the tidyverse equivalents:
Something that I do now and then, that really screws up the readability, is when I want to join a dataframe with another, but forget to manipulate the second. And it turns to something like:
"R makes it easy to create DSLs thanks to three features of the language:
* R code is first-class. That is, R code can be manipulated like any other object (see sym(), lang() and node() for creating such objects). We also call expressions these objects containing R code (see is_expr()).
* Scope is first-class. Scope is the lexical environment that associates values to symbols in expressions. Environments can be created (see env()) and manipulated as regular objects.
* Finally, functions can capture the expressions that were supplied as arguments instead of being passed the value of these expressions (see enquo() and enexpr()). "
dplyr is a testament to how these nonstandard features of R can be combined to produce a language for interactive data analysis whose convenience and clarity cannot be matched by traditional languages.
[1] http://rlang.tidyverse.org/articles/tidy-evaluation.html