This is really neat; I'm working with OCR output and hand-rolled something very similar to `ggpage_plot` about two weeks ago to recreate the original layouts with colored labels. Super helpful to get a sense of how to engineer some spatial features to feed into classification models. Having this around might have saved me some time!
Interesting project, and surely a great showcase of R graphic capabilities. For somebody like me coming from python/matplotlib, which gives almost unlimited freedom in creating complex visualizations, when approaching R graphing should I focus on basic R ("graphics" package) or learn some ggplot2?
At first it looked like basic graphics is quite limited, but after I learned that I can draw rectangles and polygons I re-evaluated it a lot. OTOH I can't avoid the feeling that all ggplot2 gives me is some canned styles and a very uncomfortable syntax, like "ggplot(...) + geom_bar(...) + theme(...)" where `+` means something I can't fully comprehend (because "The Grammar of Graphics").
Please help me change my mind if I'm being ill-informed, I do want to take the most I can from R graphing. ggplot2 is hugely popular so it must be doing something right.
For most everyday plotting tasks freedom is overrated. What ggplot gives you is unparalleled productivity, combined with a 'grammar of graphics' API that makes it very pleasant to specify and experiment with visualizations. It's also easy to make ggplot look publication ready with little effort.
If you need something highly customized it can be quite a bit of work, but at least in my experience you almost never do.
The + should just be taken to mean "add this layer to your plot". It's building up a graph object, whose print() method happens to also be its plot() method.
Basic graphics are actually tremendously powerful and can basically give you pixel level control. However the documentation is trash and many of the higher-level functions have hideous defaults.
Ggplot is great for fast iteration and exploratory data analysis, especially when "coloring by group" and "faceting" are involved.
The problem in basic R graphics is that you basically spend all your time messing with margins and font sizes and the like. Not that there's no fiddling with ggplot2, but it's much better at not generating graphs with all your labels overlapping. Literally the only time I use basic graphics is for heatmaps (all the major packages still seem to use basic graphics for those for some reason). Ggplot2 may not be "intuitive", but it definitely is worth learning. And it isn't just an R thing anymore -- ggplot2-inspired graphics systems are beginning to be created for lots of other programming languages these days.
I find that a fascinating reaction given how rapidly %>% have been taken up across a large segment of the R universe, to great excitement! Personally, I find it far MORE legible than endlessly-nested function calls.
It results in code that more closely resembles executed order of operations (e.g. filter -> mutate -> group -> summarize). Context is also key: it's most often used for data processing pipelines in specific analytical scripts or literate-code documents - less so used when defining generalizable/testable functions in packages (again, just a personal perspective - YMMV of course)
you nailed it. dplyr is better the further you are from doing heavy duty data analysis or creating production code. if you're writing some simple transforms to put data into a report, fine. someone is probably going to want to look at that at some point and it's much, much easier to understand. but for anything else i stick with data.table.
Because combining an argument with a function call doesn't make sense. They have to do some voodoo under the hood to make it work and this reduces code understandability. The analogy with Unix pipes doesn't work either: one is passing an argument to a function while the other involves writing & reading a file. Finally, it's plain ugly and un-Lispy.
The problem with data.table is that in practice your data gets converted to something else when you pass it through packages -- many functions will return a data.frame or matrix, others in the Hadleyverse will return a tibble, and so on. So you have to constantly force your data back into a data table. R has so many datatypes that basically represent a spreadsheet/database table.
There also isn't any reason why other implementations of pipes have to do buffered byte read/writes, passing objects is perfectly acceptable.
The structurally distinguishing aspect of a pipe-and-filter style is that the individual processing elements don't "return" to their "caller", but rather pass their result on to the next processing element. Without involving the caller.
Maybe not a direct answer to your question but using dynamic variable names is kind of tedious with dplyr. You have to go around it by using paste() statements before passing the argument to a dplyr function, so it's not always elegant either.
I like it but it makes it hard to take R as a serious programming language (thankfully it's not in standard R) because where else would you actually use a coding pattern like:
The alternative is to use a different variable for each data transformation, which has costs for both system memory and code readability. And modern data analysis has a lot of transformations.