The case for a pipe assignment operator in R

derbOac · on Dec 30, 2023

Personally I've never liked pipes outside of scripting one-offs because I've found them difficult to read. This is because in languages like R argument order can have meaning.

Even if the piped argument goes in a certain location by default, it still creates an exception to all the other arguments' syntax, which adds another layer of complexity.

I guess at heart lisp syntax has always made the most intuitive sense to me personally. algol/c does as well, but the more you mix them the more I tend to dislike it.

I guess pipes always feel like you're "proceduralizing" functional constructs, which makes things more murky (although maybe that reflects some deep equivalence between the two?)

This is just my personal preference, and in an R context. I could be convinced differently. Part of it too is I feel like R is fragmenting by lots of defacto standards, and not in desirable ways. I increasingly find myself wanting to use other languages, even more so than in the past.

2devnull · on Dec 30, 2023

I agree. The goal of these efforts seems to be to make r less idiomatic, more “tidy”.

They do this by introducing new idioms. Granted they are idioms that if every r user would adopt 100% and all older code were rewritten 100% would in fact make the language more tidy. But that will never happen. So it ultimately is counterproductive imo. If I want to write concise shell style pipelines for data analysis I could use awk. If I wanted readability I could use python. Don’t try to cram all of those things into Posit/r.

aquafox · on Dec 30, 2023

According to the tidyverse style guide, using magrittr's pipe assignment operator %<>% (also mentioned in the blog post), should actually be avoided. I guess because it makes code reading harder.

countrymile · on Dec 30, 2023

It also makes the testing of the right hand side of the pipe rather difficult

a <- a |> mean()

Allows me to test `a |> mean()` to check it works before it gets assigned to replace a. Very useful when you have very long pipe commands. <|> Wouldn't allow this.

dash2 · on Dec 30, 2023

Test in the REPL with `a |> mean()`. Then add a `<` when it works.

napoleongl · on Dec 30, 2023

It still annoys me that they had to invent another pipe instead of just recreate Magrittr’s in C to gain the speed it apparently lacks. And then missing out on one of the nicer functions of magrittr… I agree with the post that such an operator should exist., but I think it should simply be the magrittr one.

rcthompson · on Dec 30, 2023

I use magrittr pipes pretty much exclusively because of the extra features relative to the native pipe, but I also respect the R devs' decision not to add those extra features to the core of the language. The magrittr pipe adds additional syntax beyond the pipe itself, namely the '.' variable to refer to the piped value (essentially it's just Perl's '$_'). It's incredibly useful, but it's not particularly intuitive the way the basic pipe is.

vasili111 · on Dec 30, 2023

After version 4.2 R native pipe also got place holder. It is “_” symbol. More here: https://davidbudzynski.github.io/general/2022/04/23/r-native...

rcthompson · on Dec 30, 2023

Yeah, but it's a single-use placeholder. AFAIK It can't do things like:

    x %>% set_names(., str_c("X_", names(.))

All the underscore does is let you pass the piped argument as a named argument instead of the first otherwise-unfilled argument.

hadley · on Dec 30, 2023

There are few differences in performance due between the base pipe and magrittr due to the work we did in 2.0.0: https://www.tidyverse.org/blog/2020/11/magrittr-2-0-is-here/.

This release also cleaned up some edge cases so that the semantics are as similar as possible too, making it easy for folks to switch between the two.

nomilk · on Dec 30, 2023

This talk mentions other reasons for the move away from the magrittr pipe, apparently it was very difficult to debug: https://www.youtube.com/watch?v=X_eDHNVceCU&t=1h08m33s

jgalt212 · on Dec 30, 2023

all functional languages are difficult to debug.

y = h(f(g(x)))

where do you put the print statements or loggers?

bradrn · on Dec 30, 2023

In Haskell, you put it in the pipeline: for instance,

    y = h (f (traceShowId (g x)))

(Haskell is normally pure, but these functions from Debug.Trace are a notable exception.)

jonnylaw · on Dec 30, 2023

You use a tee https://en.wikipedia.org/wiki/Tee_(command)

jgalt212 · on Dec 30, 2023

ooh, that's useful. thanks.

mtreis86 · on Dec 30, 2023

In Common Lisp you put the print statement inline, it will return the result of the form you pass it so it doesn't change the results. If you want to know what f is returning in the form (h (f (g x))) you'd (h (print (f (g x)))) and it will print what f returns to the repl.

Or use Stickers provided by Sly which sort of log things as they change. https://joaotavora.github.io/sly/#Stickers In that case you'd highlight the form starting at f, and turn a sticker on. Then run the code, sticker outputs into a window.

nvy · on Dec 30, 2023

I always thought R was a vestigial artifact of the bad old days of computing that had no place in the Real World (tm), but then I had to use it for a graduate-level data analytics course for business students.

And then it all made sense. Non-coders equate R with RStudio to the point that they don't really get that R and its editor were separate concepts, and therein lies the power.

Everybody uses the same IDE and it always works. No more figuring out why the language server isn't offering completions or wondering why ESS has such awful defaults.

R (and the tight integration with RStudio) makes it very easy for non-programmer types to conduct their own sophisticated exploratory analysis and to produce good looking plots without mucking about in Excel.

I still hate the `<-` assignment operator though.

rcthompson · on Dec 30, 2023

I actually love that R uses '<-' for assignment. Assignment is an inherently asymmetric operation: it takes the RHS and assigns it to the LHS, and the arrow visually indicates this. It feels more correct to have a visually asymmetric operator for it.

_Wintermute · on Dec 30, 2023

This makes sense, but then why the use of `=` for function arguments?

rcthompson · on Dec 30, 2023

Because in that context the equals sign makes sense, e.g. "Call function F with X equal to 5."

_Wintermute · on Dec 30, 2023

I don't see how that logic differs from using `=` as an assignment operator.

"Assign X with value equal to 5"

rcthompson · on Dec 30, 2023

When you do `f(x=5)`, x doesn't get set to 5 anywhere you can see (it happens inside the body of f, which you are treating as a black box). So logically, in the context of the code that's calling f, it's not an assignment operation.

Admittedly this kind of goes out the window with a lot of tidyverse non-standard eval stuff. E.g. if you do `df %>% mutate(x = 5)`, then x does get set to 5 in the data frame, so that is logically an assignment of x in a visible scope using '='. Going by my logic, functions like mutate should ideally use a different symbol for operations that alter the input like this, but we run up against syntactic limitations of the language.

civilized · on Dec 30, 2023

I think the argument is that, to the uninitiated, x = x + 2 looks more like an invalid equality than "replace the value stored in x with its current value plus 2".

Personally, it seems to me that = for assignment is so ubiquitous that this argument has become stale.

I still use the arrow because it's the prevailing recommendation in R style guides.

Staple_Diet · on Dec 30, 2023

I use R quite a lot for data extraction, manipulation and visualisation, it is the most used tool in my area of academia. You are correct that RStudio seems to be the only IDE but I'd say most users know that R is the language and RStudio the IDE.

That being said many decent GUIs have been developed for a more traditional analysis feel, but still use R under the hood, I like JASP the most out of the current options.

dash2 · on Dec 30, 2023

Other things:

1. Unmatched simplicity of dplyr for data manipulation.

2. Great graphs with ggplot2.

Oh, and 1-based indexing. Try telling economics students why 3..5 means elements 4 and 5.

jraph · on Dec 30, 2023

(edit: we are actually agreeing, see replies)

Doesn't 3..5 mean elements 3 and 4 : the 3rd and the 4th? 0-based indexing might feel less intuitive actually (element 3 is the 4th, and element 4 is the 5th). I don't know much R, I'm assuming that in the n..m notation, n is included and m is excluded (edit: also included, see replies).

I don't now about economy, but I think mathematicians use 0-indexed and 1-indexed sequences interchangeably and they usually (have to) specify which is the first term (u0 or u1).

rcthompson · on Dec 30, 2023

R's indexing is not only 1-based, it's inclusive on both ends. So indexing from 3 to 5 actually gets you elements 3, 4, and 5. I've programmed a bunch with both R and 0-based languages and I greatly prefer R's way of doing indexing.

trealira · on Jan 1, 2024

I'm pretty sure it's like this in Pascal, too. Which makes sense, since it's the only way to reasonably index 1-based arrays. For example, you'd expect the following code to set arr to the set "1, 2, 3, 4, 5", not to leave the last element uninitialized.

  var
    arr: array [1 .. 5] of integer;
  begin
    for i := 1 to 5 do
      arr[i] := i
  end;

Normally you would write "for i := low(arr) to high(arr)". I just wrote the numbers "1 to 5" to be explicit here.

williamdclt · on Dec 30, 2023

> Doesn't 3..5 mean elements 3 and 4 : the 3rd and the 4th? 0-based indexing might feel less intuitive actually

Yes, you’re agreeing! The point of the parent was that 0-indexing is less easy to grasp

jraph · on Dec 30, 2023

Oh, right! I parsed the last sentence in a weird way I guess. Also thanks rcthompson for the precision about intervals being inclusive on both ends in R.

dash2 · on Dec 30, 2023

Ordinary humans count from 1.

uxp8u61q · on Dec 30, 2023

> I always thought R was a vestigial artifact of the bad old days of computing that had no place in the Real World

Why on Earth did you think that? Because it's not what the hip frontend devs use?

_Wintermute · on Dec 30, 2023

Probably because it's full of inconsistencies and gotchas, largely caused by backwards compatibility with S.

uxp8u61q · on Dec 30, 2023

What programming language isn't full of inconsistencies and gotchas?

ekianjo · on Dec 30, 2023

Its not as bad as most people make it out to be

nvy · on Dec 30, 2023

Because it has five assignment operators and a tendency to fail silently and to keep going after something errors, producing bad results for people that aren't programmers and maybe aren't good at debugging, and is only usable if you install a third party package (tidyverse).

uxp8u61q · on Dec 30, 2023

Keeping going after errors is common for many scripting languages, isn't it?

nvy · on Dec 30, 2023

No idea. Silently failing and continuation is a complete anti-feature if you want correct results in your data pipeline though.

I get it, you like R. I'm not interested in litigating this any further.

Vaslo · on Dec 30, 2023

I always just use = and never have any issues.

hermitcrab · on Dec 30, 2023

I had a little play with RStudio and it seemed very expensive for what it did. But I am not an R programmer might be missing the point completely.

sin7 · on Dec 31, 2023

There's a free community version.

Lyngbakr · on Dec 30, 2023

I realise I'm in the minority, but I personally like using the right assignment operator (->) with pipes because it maintains the flow of the code when reading.

mtcars |> filter(am == 1) |> select(mpg, gear) -> df

fhsm · on Dec 31, 2023

This is the only setting in which I use right assignment and I do so routinely for exactly the same reason.

I don’t see every using a read and replace pipe operator over the right assignment approach.

tylermw · on Jan 2, 2024

Agreed--the R base pipe and right assignment operator obviate the need for a pipe assignment operator.

vouaobrasil · on Dec 30, 2023

I like R a lot but I am not a fan of this sort of operator. It seems idiosyncratic and somewhat ugly also. I think code should be readable beyond all else and the examples with this operator do not look very nice.

nomilk · on Dec 30, 2023

Interesting. I am happy assigning the traditional way but don't mind the idea of an assignment pipe. Personally, I still use the magrittr pipe because I prefer the syntax (and love the keyboard shortcut) to the base pipe.

The evidence section kinda lost me (what is `wardmap@data` supposed to be? - I'm not sure it's valid R code, and can't spot it on stack overflow as the article suggests?)

Small correction: under the bit about 'A simple syntax transformation converts .. into ..' the two should be the other way around.

dkga · on Dec 30, 2023

You can change the keyboard shortcut to the native R pipe in RStudio if you like.

I did it just to avoid the dependency because truth be told the native pipe seems a bit ugly to me (but that’s just me)

dash2 · on Dec 30, 2023

wardmap@data is the data slot in the S4 object wardmap. A little advanced…

nomilk · on Dec 30, 2023

Hadn't come across it.

An assignment pipe should be a relatively easy sell since it produces more concise code, and lets the developer work left to right, top to bottom (how we all want to work), rather than having to backtrack to place `df <- ` at the start.

I'd try demo it on super common use cases (so the average R user can easily see how coding patterns are improved by it).

Found the Stack Overflow question: https://stackoverflow.com/q/34500567

EDIT: came across a nice explanation of the S4 system [1] under the 'Overview' and 'Slots and accessor functions' sections. Seems it's commonly used in genetics and geospatial work.

[1] https://kasperdanielhansen.github.io/genbioconductor/html/R_...).

hadley · on Dec 30, 2023

Just a small note that R is not pass-by-value, it’s more copy-on-modify. See some of the details at https://adv-r.hadley.nz/names-values.html

baq · on Dec 30, 2023

I'd love to have the pipe operator in Python, pandas/numpy would be so much easier to parse.

pandas tries to chain methods but it isn't the same and .apply() is ugly.

This <|> I'm not sold on, though.

pharmakom · on Dec 30, 2023

Yeah it’s remarkable how for its main use case Python … kinda sucks!

It feels like R is in decline though.

REDS1736 · on Dec 30, 2023

I agree with you proposing this operator to simplify the specific (but not uncommon!) situation you describe. But i don't think, it exposes a problem to be solved but rather an antipattern to be avoided. Consider the following example in which mytransform is some function i want to apply to a part of the dataset:

  mydata <- read.csv('mydata.csv')
  mydata['score'] <- mytransform(mydata['score'])

This can be simplified with your proposed operator:

  mydata['score'] <|> mytransform()

An upgrade in elegance, i like it. But in R, a language which is commonly used switching between scripts and the REPL, with an IDE which by default captures your workspace so all your variables etc are restored the next time you resume the session, in this environment i feel like variables should be used as constants as much as possible. Mutating a variable (as in my example) creates room for confusion: Does mydata contain the raw csv data or the transformed data? Did i already evaluate this line in the REPL or did i not? What happens if i evaluate this line twice? Many people i know tend to "jump around" in their scripts, not following the written order of operations. This creates potentially irreproducible environments. My proposed solution is treating all variables (or at least as many as possible) like constants:

  mydata <- read.csv('mydata.csv')
  mydata_transformed <- transform(mydata)

Now i always know what a variable contains because each variable contains exactly the same value, independent of when i evaluate.

But nevertheless; i kinda went on a tangent here and strictly speaking, the problem i describe only arises from careless user behavior (which is quite prevalent in statistics though). Aside from this kind of behavior, i think this operator is an elegant idea!

JoeyBananas · on Dec 31, 2023

R is a really bad language, it's mainly kept alive by crusty old outdated textbooks

condwanaland · on Jan 1, 2024

R has got great parts to it. It has the two best libraries for general data manipulation (dplyr and data.table), which blow equivalents in other languages away. It has the best plotting library (ggplot2). And the RMarkdown and Shiny ecosystems are incredibly powerful and useful tools.

R is used because it remains the single best language for data manipulation and reporting.

dan-robertson · on Dec 30, 2023

I’m confused how this makes the hack around mutate better? In particular the article says that instead of

  … |> mutate(foo = f(foo)) |> …

You can write

  df$foo <|> f

But that seems to not be the case? You need to introduce new names, e.g.

  … -> tmp 
  tmp$foo <|> f
  tmp |> …

Which seems worse to me. Maybe I’m missing something.

For comparison, consider jq's |= operator where you might write e.g.

  … | (.foo |= f) | …

Which I think composes better with pipelines that do not “mutate” (or rather, make substitutions)

dash2 · on Dec 30, 2023

No, you can indeed write `df$foo <|> f()`.

dan-robertson · on Dec 30, 2023

How do you use it in the middle of a pipeline like the example with mutate?

dash2 · on Dec 30, 2023

That’s fair. But that’s the way mutate is designed: it uses `foo` as shorthand for `df$foo`. You could do

    …|> (\(x) x[vars] <|> bar())()

But now we are getting into self-parody… as the article says, it’s a shame there’s no nice anonymous function call syntax.

Or possibly

    … |> within(foo <|> bar()) |> …

zabzonk · on Dec 30, 2023

and people complain about C++

but actually i quite like R for graphing and such stuff, despite wrestling with the syntax.

fithisux · on Dec 30, 2023

I think syntax needs some more work.

I am hopeful of S7 but I still believe that R needs some proper PL engineering.

It has many dangerous corner cases and mutability when it shouldn't.

I migrated to Julia for my experiments/computations. Safer.

Still R has some great ideas but not developed to their proper extend.

S7 is in the right direction. Satic typing could also be a big thing.

hackandthink · on Dec 30, 2023

Had never heard of S7, I thought it was a joke at first.

https://rconsortium.github.io/S7/index.html

owlstuffing · on Dec 30, 2023

R scratches an itch no other language quite manages to reach, which is why it is used despite its peculiarities. Lots of languages exist[ed], more or less, for the same reason (php and javascript come to mind).

hermitcrab · on Dec 30, 2023

A visual approach with nodes and arrows gets around language syntax issues. However it comes own challenges, e.g. descending into a spaghetti mess if you aren't careful.

nerdponx · on Dec 30, 2023

I used magrittr %<>% a lot back when I used R on a regular basis. Given that |> made it into the language, a pip-assignment operator doesn't seem too outlandish.

civilized · on Dec 30, 2023

I wouldn't be opposed to the addition of this new pipe. The author has a decent argument for it being helpful in some complex special cases. But I don't know if I would want to see this used extensively. IMHO the proliferation of new notation in a quest for ever-more concise code just leads to APL, which most people don't like. Shorter isn't always more readable.

The comparison in this example is misleading:

  # before
  names(data)[1:2] <- paste0(names(data)[1:2], "_suffix")
  # after
  names(data)[1:2] <|> paste0("_suffix")

Since we're talking about pipes, the first option should use pipes:

  # before
  names(data)[1:2] <- names(data)[1:2] |> paste0("_suffix")
  # after
  names(data)[1:2] <|> paste0("_suffix")

For pipe enthusiasts this is already pretty clear. The thing on the left and right side of the assignment is the same. Compressing this saves characters and avoiding repetition of the 1:2 part is nice, but I don't know if the cost in familiarity is worth it. In any case involving a data frame, I would prefer using the .cols parameter of the rename family of verbs to either of these base R approaches.

Also, a linked post by this author [1] is materially outdated on the use of case_when and across (of course I sympathize - the tidyverse has moved very fast in the past few years). Thanks to the new backslash notation for anonymous functions and various other dplyr upgrades, it is very elegant to do assignment across multiple columns based on conditions evaluated against the entire data frame. Behold:

  mtcars %>%
    head()

      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
  2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
  3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
  4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
  5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
  6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
  
  mtcars %>%
    mutate(across(mpg:hp, \(x) case_when(wt < 3 ~ x * 10, wt >= 3 ~ x))) %>%
    head()

      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  1 210      60  1600  1100  3.9   2.62  16.5     0     1     4     4
  2 210      60  1600  1100  3.9   2.88  17.0     0     1     4     4
  3 228      40  1080   930  3.85  2.32  18.6     1     1     4     1
  4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
  5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
  6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1

Over the past decade, one thing I have learned is to never, ever count Hadley Wickham and the tidyverse team out when it comes to optimizing their API. If there is a lack of expressiveness or orthogonality, they will fix it. The R community will take a while to absorb their ideas because there are so many idioms flying around (partly the fault of previous versions of the tidyverse), but the model they've landed on recently is incredible and should be a model for tabular data manipulation in any language.

[1] https://davidhughjones.medium.com/ae364da6a46f

plagiarist · on Dec 30, 2023

Readability wasn't the sole benefit in their mind.

civilized · on Dec 30, 2023

The posited DRY-related benefit ultimately also comes back to conciseness, because (as the author does note) you can always avoid repeating yourself by assigning expressions to an intermediate variable instead of repeating expressions.

junke · on Dec 30, 2023

Pipes are useful when you have multiple operations to combine, I don't think a <|> b <|> c makes as much sense?

dash2 · on Dec 30, 2023

It’s very common to do

    data <- data |>
      filter(…) |>
      mutate(…) |>
      left_join(…)

… Etc. This would simplify to

    data <|> filter(…) |> …

tetris11 · on Dec 30, 2023

Not super related, but R also needs some way to handle lists in the pipe chain. Maybe something like `&>` to apply over all elements (and `1>`, `2>`,..., `N>`, for piping on individual elements), and a collector operator like `]>` (or `1,2-4,N]>` to collect subsets), so that one can do:

> data <- data |> > split_by(field) &> > # mutate over all > mutate(calc=...) 1-5> > # filter the first 5 elems > filter(…) &]> > # control back to parent > collect() ]> > # collect all elements and > # implicit rbind them > ...

dash2 · on Dec 30, 2023

purrr::map is fine for this, I think.

junke · on Dec 30, 2023

I get it, but the syntax is surprising. This is more like += etc. so I would expect a notation like ()= (ie. apply-equal) in a way that is not related to piping I guess.