Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Lets-plot: An interactive Python plotting library using ggplot's API (github.com/jetbrains)
96 points by oroszgy on Sept 9, 2020 | hide | past | favorite | 59 comments


I've been using R professionally for 4 years. I've been a Python programmer for 6 years. I've shipped large scale applications built on both languages to production.

Findings:

I honestly don't get the hype around R's plotting capabilities. ggplot2 is nice, but far too magical to be easily understood by beginners.

Python offers more intuitive memory management than R (which is really more a statement on how bad R is at memory efficiency).

R blows Python out of the water when it comes to expressing concise and readable linear algebra/stats computation. Done right, base R looks like mathematical pseudo-code.

R's non-standard evaluation is an under-appreciated killer feature. I've seen complex and powerful DSLs implemented implemented in base R. Python has some support for this kind of thing, but nowhere near the level of R.

I tried the Tidyverse and found it to be too magical and unstable. In general, the R package ecosystem is weird. Look at Stack Overflow and you'll find people touting all sorts of different packages to do simple operations. Needing to download a million random packages from grad students of the internet is a recipe for disaster.


But I want ggplot to do magic. Rendering things is extremely tedious and boring. I don't want to spend hours on stackoverflow searching for the special obscure incantations that will shift the legend and margins this way because matplotlib did an ugly job while constantly adapting boilerplate back and forth between the declarative and object-oriented API, I want it to just do what I mean!

Is there any coincidence that there are literally tens of different (and concurrent) rendering libraries for Python while the R world more or less settled for ggplot2? Writing matplotlib is such a pain in the ass that people go to great lengths to actually avoid using matplotlib (while still not having the expressivity, features and ease of use of ggplot2).

Magic is bad when you're shipping production code that's shared among multiple people who have to then spend a lot of time assimilating the mental model of the magic. It's perfect when you just want to plot stuff and draw nice figures.


Base R plot() can also be pretty powerful[1] and look great.[2] As an alternative to ggplot2 I would also propose that Vega Lite[3] could be a contender, with an excellent cross language ecosystem.

[1] http://karolis.koncevicius.lt/posts/r_base_plotting_without_...

[2] https://github.com/KKPMW/basetheme

[3] https://vegawidget.github.io/vlbuildr/


Same and I have to disagree with basically everything you’ve said, save the DSL bit and the R package bit.

In particular, I don’t see how the tidyverse is at all ‘magical’ - take the two most popular tidy libs, dplyr and ggplot2. The api for dplyr is very explicit, intuitive, and based on long-standing precedence (sql). The api for ggplot2 is admittedly less intuitive, but is itself an implementation of widely known framework (Grammar of Graphics). If by magical you mean in their abstractions, those are about as far from magical as one can get.

The tidyverse does get iffy sometimes when you need to dig into the rlang/tidy eval area, but that’s all well-documented.

I haven’t had to use R in a production environment for about a year now, but I always enjoyed the concise tidy api.

Base R is a total mess and largely inconsistent, but I guess that’s what you get when statisticians from different uni’s patchwork a language in their spare time.


>The tidyverse does get iffy sometimes when you need to dig into the rlang/tidy eval area, but that’s all well-documented.

Non-standard evaluation, in my opinion, does fine in the top layer (analysis scripts, DSLs). But it's a pain to build on top of magic without a more concrete layer between. Base R uses character vectors and name attributes to hide the magic. The Tidyverse uses symbols. I find character vectors much easier to understand and manipulate.

Formulas are the only scenarios in base R where symbol and expression manipulation is sometimes necessary. And I don't do that thinking, "Boy, I wish this was how I did 95% of my work."

Personally, I use data.table. It's been great even for packages. If it's an internal package dataset, I don't need to manipulate symbols. And if it's a user-supplied argument, there's usually a simple and fast way to do it with a character vector. Or I define an S3 class, give a helper function that produces a dataset with known names, and just write functions around that. The user can handle combining data however they want.

>Base R is a total mess and largely inconsistent

I wouldn't go so far as to say "total mess." Not even an annoyingly big mess. The few functions that make up most of analysis or package code are dead simple. Inconsistent argument names is hardly a problem with an IDE.


The dplyr team is genuinely working hard to make the tabular data manipulation tool. IIRC versions 0.5, 0.7, and 1.0 all brought major enhancements. The basic structure has remained consistent but the advanced API keeps getting refactored, usually for the better I think.

(I know some people swear by data.table, but I find it gets ugly fast if you want to do anything more than simple joins or split-apply-combine)


Data.table's join syntax is literal insanity to me. That said I used dtplyr to generate data.table code recently and it worked well, except for one edge case.


> Python offers more intuitive memory management than R (which is really more a statement on how bad R is at memory efficiency).

I haven't found that to be the case. Loading a 2gb CSV file when you have 8gb of ram is touch and go with pandas but with data.table it’s a breeze not to mention operations once loaded up are a fair-bit faster. The pythong version has recently been released and already provides a serious speed bump over pandas not to mention out the box memory-mapping for huge files. I recommend checking it out if you haven't heard of it https://h2oai.github.io/db-benchmark/


The Tidyverse is under development and it can be tough to keep up with the changes or know where to start (e.g., plyr vs dplyr, melt vs pivot_longer). However, ggplot2 is based on a pretty solid theory of graphics (Bertin's visual variables and Wilkinson's grammar of graphics). These can take some time to get into but it's not magic once you get the idea that you're mapping data variables to visual variables.


> Needing to download a million random packages from grad students of the internet is a recipe for disaster.

This is baloney. You don't need to rely on a million R packages. You really only need a good plotting library (the built-in one is fine, or use ggplot), a good data frame library (dplyr, data.table, or use build-in one), and a few extras to handle some weird rough edges (lubridate, forcats).

This takes you to 90% to your analysis goals.

Only novices are 'using a million random packages', which is probably the case for python as well.


I also believe this is now being addressed by the pharmar[1] group with their riskmetric[2] package.

[1] https://www.pharmar.org/

[2] https://github.com/pharmaR/riskmetric


Indeed. `library(tidyverse)` and I’m set for like 70-80% of use cases for data manipulation, exploratory data analysis, and statistics.

Though to be complete, any ML project means python and scikit learn.


> R blows Python out of the water when it comes to expressing concise and readable linear algebra/stats computation. Done right, base R looks like mathematical pseudo-code.

Are you including Numpy as part of Python in this statement? I've found Numpy to be as mathematically expressive as R, although I'm not a R power-user.


How about Julia? I’m watching that MIT class with Grant Sanderson that uses Julia. It seems like a reasonable language, and it’s fast.

Of course, I reserve the right to change my opinion after I build a few things. Dynamic typing... hmmm...


Genuinely curious: what do you mean with respect to "R's non-standard evaluation" and its relationship to DSLs? Thanks in advance.


R some very lispy meta-programming features. Very briefly, if I'm a function (go with me here) `foo <- function(x,y){whatever}`,and someone calls me, `foo(bar,baz)` I get to know that `baz` is "bound" to `y`. The caller doesn't even have to have a `baz` in scope; substituting `y` for `baz` is done lazily, if the function never asks for the actual value of `y`, there's no problem. You can read more about this here http://adv-r.had.co.nz/Computing-on-the-language.html


Great


I wrote a summary that goes into some detail:

http://blog.moertel.com/posts/2006-01-20-wondrous-oddities-r...


Fantastic


R gives you the ability to capture un-evaluated expressions, manipulate them, and then evaluate the manipulated versions. This ability lets you construct some fairly powerful DSLs and language extensions.


Thanks for the succinct explanation


Do you have any insights on Julia?


I am not the person you are asking, but having used octave, matlab, R, Python for many years and julia for about a month I think that for numerical/algebraic stuff, julia is the one to go; the only reason holding me back is that communities using octave-matlab-R need to be convinced and their codebases need to be ported.


I've been a Python lover for decades. I dislike a lot of characteristics of the R language, but the ggplot2 package is superior to anything in Python visualization space. It is really excellent.


Wickham was really on to something with ggplot and it's graphical grammar concept for plots. That and Tidyverse in general, I believe, has saved R from irrelevance.

I wish more projects would use that way of thinking but it seems that in the jupyter/julia/python world there's too many choices and they all attempt a "kitchen-sink" approach for visualization.


Honestly, R will be fine, even apart from tidyverse.

There's a huge shadow universe of scientists running particular R packages for these analyses, which never gets seen from HN/programmers.

While Python is definitely (sadly, IMO) a better language for building ML systems (because more people know it, and it's harder to write impenetrable code), R is definitely a better DSL for statistical analysis, modelling and graphics.

It's a shame that R takes its lineage from a language developed in the 70's (S), as that's a cause of many of the inconsistencies in the core language.


I agree R would have been "fine" without Tidyverse-- in the sense that it would keep a persistent survivable niche base.

I suspect, however, that it would not be anywhere near the top 10 of the TIOBE index like it is now and that it's userbase would consist of mostly of statistics practitioners rather than huge swathes of people that need to perform any basic data analysis.


> Tidyverse in general, I believe, has saved R from irrelevance.

It’s interesting that you say this. I’ve been using R daily for almost 3 years now, and I was originally taught the Tidyverse.

However, I also tutored in biostats and have collaborated with many different faculty and students.

On one hand the Tidyverse created an opening for R learners especially, but it has lead to some controversy as well.

Trust me when I say there are plenty of R users who have never heard of or used the Tidyverse.

It leads to some difficulty in teaching/collaborating because do I use grepl or str_detect? lapply or map? Do they know the magrittr syntax (%>%)?

Then there’s non-standard evaluation, which often forces some arbitrary meta-programming on novice programmers.

R is foremost the stats language, and tidyverse has attempted to make it more general purpose, for better or worse depending on who you ask.


Tidyverse is much, much harder to write functions for, which is pretty bad for beginners. The API is lovely, but the non-standard evaluation causes real problems for people.

True story: writing tidyverse functions is so hard that I once worked for a company with business critical code running in incredibly long R-scripts with almost no functions, and 100 line pipes.

While that may be an extreme example, base-R is much, much easier to get started writing functions for (like I've been using R for a decade now, and the "idiomatic" way to use NSE with ggplot and dplyr has changed multiple times over that time).

Meanwhile, base-R is fugly, but its solid and backwards-compatible to a fault.


As a huge fan of tidyverse, I have to agree. Tidy eval is anything but.

However, once they get that right, then tidyverse is close to perfection

I think the newish “interpolation” syntax with double braces {{ }} is getting there.


See, it's changed again!

Damnit Hadley, why must you do this to me?

(Really I'm bitter because I know that I'll end up maintaining a bunch of important code in version N-2 of Hadley's NSE adventures at some point in the future).


Changed, yes, and finally in the right direction I would say!


... and the following is what I mean. I think this is as close to clean and understandable as it can get, for most use cases (variables in functions, including creating new variables):

https://www.tidyverse.org/blog/2020/02/glue-strings-and-tidy...


The biggest difficulty of NSE is not the syntax, but the requirement to hold parallel mental models while writing the code: symbolic expressions and value manipulation. Juggling will always be difficult.


R has always been relevant and isn't going away any time soon. The language is ugly and flawed and has a bunch of quirks and gotchas but there are just too many libs.

Also for a lot of people R is going to be the first and often only language they're going to learn because they have no use for programming outside of statistical analyses and it's damn practical for that single use. Right now generations of students are being taught in R, right now bioconductor is booming and being added to everyday. That stupid arrow assignment operator has plenty of good years ahead of it.


I was honestly blown out of the water discovering ggplot2 after years of matplotlib, and a bit infuriated that I didn't make the effort to get into R sooner.


This thing can become really big, if this is a JetBrains supported lib, instead of just an employee side project


Depends, it does not do interactive 3D-plots. For that, Plotly is the only performant lib I know.


Have a go with the Veusz package (I'm lead author). It does interactive 3D plots, has a GUI and Python interface. I think the 3D speed is reasonable (providing the BSP clipping isn't switched on).

http://veusz.github.io/


3D plots are usually bad :-) Your vision system can't compare volume very well.


Do try Altair. Here's a post from someone who comes from ggplot2 to Altair - http://fernandoi.cl/blog/posts/altair/

here's the path to Altair scrapping 20k lines of code to make the api simpler - https://twitter.com/jakevdp/status/1006929128119926786

Altair is really good. Probably as good as ggplot2


I love Altair but the practice of encoding data alongside the plots makes it unwieldy for sharing jupyter notebooks (by default a 5000 row limit). Github also fails to preview Altair plots in .ipynb right now



From a cursory glance at both projects, it looks like plotnine doesn't do interactivity, for a start?


I wonder why they decided to start from scratch in Kotlin rather than just add the interactivity feature to plotnine. Now we have two competing projects...


I'm a big fan of Plotnine and recommend it to anyone who will listen.

My (wildly speculative) guess for the motivations behind this library would be that Kotlin, as a base, is usable in the other JetBrains IDE's, e.g. reusing the same library and grammar when developing in Java or Ruby.


I'm pretty sure that there have been multiple attempts to recreate ggplot in python.

I can think of 3-4 off the top of my head.


AFAIK plotnine is the only one with substantial adoption (2.3k stars on github) and active development for several years.

Would be curious to hear which ones you're thinking of (since I wouldn't be surprised if the are others, but also am guessing no others hit both points)


Honestly, I've definitely seen at least 3-4 attempts hit HN over the past nine years or so.

Plotnine is the only one another human has told me about, though.


Altair does Grammar of Graphics and interactivity, and has ~6k stars.


Is this an official JetBrains project? Will it be supported?


Yes. You can see it in the url: https://github.com/JetBrains/lets-plot/


But does it means that it is a corporate project. Something that won't die when its leader get a better job?


Looks interesting but it feels like this space is a bit crowded with all of the plotting libraries out there for python.


And yet none has filled the “enjoyable to use” space ;)


Ya true. I usually end up jumping around depending on use case. Matplotlib ,seaborn, plotly to name a few.


I don’t think ggplot2 has API


Read beyond the headline.

> The Lets-Plot for Python library includes a native backend and a Python API, which was mostly based on the ggplot2 package well-known to data scientists who use R.

> R ggplot2 has extensive documentation and a multitude of examples and therefore is an excellent resource for those who want to learn the grammar of graphics.

> Note that the Python API being very similar yet is different in detail from R. Although we have not implemented the entire ggplot2 API in our Python package, we have added a few new features to our Python API.


Every library has an api, unless you're using some protocol-only definition of api.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: