Hacker News new | past | comments | ask | show | jobs | submit login
R: Introduction to Data Science (2019) (dfci.harvard.edu)
187 points by tosh 8 months ago | hide | past | favorite | 133 comments



i'm an old R user, now migrated fully to python.

For those of you who us R still what is your use case?

We found R has a really hard time integrating into data pipelines and was best used as a standalone tool by individuals, which doesn't really work in our particular professional setup where everyone works collaboratively together.

What we found was that R had alot of packages but most haven't been touched in years and when you contact the owner you find they've often moved onto the python/pandas/scikit eco system


About half our team can wrangle and plot as fast as we can think of ideas. It creates an incredibly tight cycle time between us having ideas and getting answers; sometimes many (e.g. 10-20+) of those cycles in a single meeting. Before we used R, it would require someone jotting down things to investigate and reporting back in the next meeting. But we can do ~80% of whatever people can think of on the spot (more involved research questions can take more time).

The unique qualities of R that allow this are that it's so easy to use, extremely reliable for package installation (problems occur approximately never), and the tidyverse makes it incredible easy to translate ideas into code, not only in its broad, easy to understand and powerful vocabulary, but in there being little 'nesting' required; instead working left to right and top to bottom (via the magrittr pipe) - i.e. your code, for the most part, is like reading a page in a book.


For exploring data, plotting data, and fitting complex models like GLMs and GAMs, using R is essentially as fast as thought at this point.

So if you have tabular data, it’s a no brainer to use R.

Getting the data into table form is often better suited for python. Fitting models that leverage autograd are also better with python.


> Getting the data into table form is often better suited for python. Fitting models that leverage autograd are also better with python.

Totally agree. If it's in a dataframe, I prefer R. If I get involved pre-dataframe, I prefer Python.


If you tell me what makes R hard to integrate into data pipelines I will do my best to fix it :)


Wow, I really appreciate the reply. As I said in another comment here, I wish tidyverse was big when I was using R.

I was an R user from about 2003-2010.

We didn't have DPlyr at the moment though ggplot2 was coming around about that time I think. That helped alot for easy to develop visualizations.

But in our specific cases, the distributed libraries we used were written in python and integrated well with native python code. Pandas was just coming out around 2010, I think, and I think multi threading was also an issue then, but I can't really remember.

So our issues was partially our infrastructure tooling was going to python, but also we had a far easier time hiring people who were proficient in python and harder to find the same for R.

And once you start writing more code in python it starts to become harder to justify two separate code bases that can do the same thing so the R code got phased out and rewritten in python so we could have a single code base and not have to duplicate functionality in two languages.

Also a slight push for python came from the programmers who thought python represented a better language to know for their careers. Which looking back it does seem like python is used more often these days in general.

So I guess there isn't much you could have done in this case.

And as a side note, thanks for all the work you've done with R!!


A few of the main issues I see, as a R user who built his company on python

- when we wanted to build a web app that processes data, it was a lot more straightforward to build both in python, so we can process data within the web servers instead of having to manage multiple stages of infrastructure and different languages. There's no Django for R.

- R will often do something instead of explicitly failing. This is the wrong tradeoff when running a production system, as if you're returning the wrong results to users you may not realize it unless there's an error

- R reproducible builds are worse than python. That's saying something because python is a pretty low bar. But running production systems you can't have builds suddenly fail week over week because one of a hundred packages was updated


> one of a hundred packages was updated

There's renv that addresses that point already: https://rstudio.github.io/renv/articles/renv.html

> There's no Django for R.

Nowadays you can integrate R with WebR (WASM) in a web app: https://docs.r-wasm.org/webr/latest/


A lighterweight alternative to renv is to use Posit Public Package Manage (https://packagemanager.posit.co/) with a pinned date. That doesn't help if you're installing packages from a mix of places, but if you're only using CRAN packages it lets you get everything as of a fixed date.

And of course on the web side you have shiny (https://shiny.posit.co), which now also comes in a python flavour.


shiny is nice for one-off data dashboards and single-purpose mini-apps. I see the python equivalents are like dash/plotly. Shiny is not a full fledged web framework, and isn't a viable replacement for e.g. Django.

Aside -- we tried using dash in our production app and then had to remove it after a month, because these types of frameworks that spit out front-end code are almost never flexible enough to do what you actually need to do in a full app context, and you end up doing more work to fight the framework versus the time-savings from the initial prototype.


I'd highly encourage you to look into shiny more. No, it's not django, but it's a much richer framework than dash, and you can always bring your own HTML if what it generates for you isn't sufficient.


I'm not arguing that dash is better than shiny -- I think shiny is probably better!

But the fact that there's no Django for R means shiny's a dead-end for a production web app.


> there's no django for R

ambiorix might be what you're looking for.

check it out: https://ambiorix.dev/

it provides: - routing - api generation - templating - web sockets


> R will often do something instead of explicitly failing.

I mentioned exception handling above, but this is more specifically the problem.

I think it's a hard problem to solve, because the behaviour of older libraries is so varied.

I have sometimes thought that something like a try catch wrapper which pattern matched or tested the value returned would be useful.


I have noodled on this problem a bit in https://github.com/hadley/strict, which I'm contemplating bringing back to life over the coming year. It's certainly very difficult to cover 100% of all possible problems, but I suspect we can get good coverage of the most common failure points (specifically around recycling and coercion) with a decent amount of work.


OK, since you're here!

(this all prefaced with a massive thank you for tidyverse, without which R is very crusty).

I love R for interactive work and quick analyses, but I'm currently trying to integrate various bits of R code into a large document-building pipeline and wishing I could use Python for it:

- Exception handling and error processing seem a pain in R. Maybe I'm doing it wrong, but if feels like a mess and not nearly as ergonomic as python. Trycatch seems to have gotchas related to scope because the error handling is in a function. The distinction between warning, stop etc seems odd. The option to stop on warnings isn't useful because older packages seem to abuse warnings as messages. I have just discovered `safely` which is helpful, but then you have to unwrap lists in pipelines which feels clunky.

- Related, I _really_ wish we could just drop model objects or other tibbles as single objects directly into a tibble cell rather than as list(df). Unpacking lists and checking objects inside them exist is much more of a pain (e.g. can't just do `filter(!is.na(df_col))`)

- I really miss defaultdict from python, and dictionaries generally.

- Passing variable names as strings to dynamically generate things seems clunky compared with python. Again, it may be because I'm doing to wrong but I end up having to wrap things in !!sym the whole time and the nse semantics seem hard to remember (I only use R about 20% of the time). I liked cur_data() for passing a df row to a function but this now seems deprecated.

- String formatting -- fstrings are just great. Glue is OK, but escaping special characters seems more tricksy. Jinjar is OK, not quite jinja.

- purrr is nice, but furrr just isn't a drop-in replacement. Making http requests in parallel seems non-trivial compared to doing it with python. Is there an easy way to do it without creating multiple processes? Why can't I just do something like `. %>% mutate_parallel(response=GET(url), workers=10) %>% ...`?


Amen to that. Can I add the following:

- 5 different ways to do wide to long and long to wide over the years even in the tidyverse. - A lot of dependencies to connect to DBs and difficult programs. Rstudio/Posit does have some premium libraries but they should be made free and bundled with the tidyverse to really promote the ecosystem. - Shiny support to save interactive charts and tables. This is a massive problem for me. If I have a heavily stylized HTML table with a bunch of css, I need to rely on webshot, webshot2 which are both alpha or beta versions and they are poorly documented. How can I evangelize R if my deployments cannot be used properly by my community?


What are the premium packages you're talking about? As far as I know all of our R packages are 100% open source.

I'd love to hear more why you're using webshot etc to talk screenshots of your shiny app. A more typical workflow would be to generate a separate HTML/PDF with quarto/RMarkdown.


Thanks for responding and your amazing work with the tidyverse. I am the "R-guy" in my finservices company and we have a paid rconnect dev/qa/prod and rserver pro licences for a few hundred users.

The packages I think are the dependencies of some DB connectivity libraries. https://www.rstudio.com/tags/databases/ - these are the ones I was referring to.

Re webshot my use case is: I have a heavily modified DT table in a shiny app. Users log in, play around with the DT table, update ggplots etc and then download the snapshot and send it to a WORD file. I can't move away from word and use html or pdf because we need the word file formatted by editors for publication and they need to follow the corpo guidelines. So, I am having to use webshot to grab a screenshot of the tagged html instead of natively handling it. I tried using officedown and a few other methods and it just didn't work.

ps: I hope the rebrand goes great and I am rooting for you.


Oh, you mean the pro drivers? Unfortunately we can't give those away because we have to pay several $100k a year just to get access for our customers. Most of the pro drivers do have equivalent open source versions that you should be able to use instead.

Hmmm, I'd still try generating the table with quarto (since you can output word documents), or try gt (https://gt.rstudio.com), which I know has much greater control over output, and supports RTF output (https://gt.rstudio.com/reference/as_rtf.html) which should import cleanly into word.


PDF in knitr is tied to TeX. Webshot and other capture is better because CSS styles work without translation to TeX.


> The distinction between warning, stop etc seems odd. The option to stop on warnings isn't useful because older packages seem to abuse warnings as messages.

Use suppressWarnings() to silence misbehaving functions or withCallingHandlers() to stop or handle specific conditions.

> Passing variable names as strings to dynamically generate things seems clunky compared with python.

Can you give me an elegant example in Python? Because I don't understand what you want to generate dynamically.

That said, I dislike the tidyverse solution as well. Too much abstraction for not enough benefit over a base solution with substitute()


For the most common cases, the tidyverse now only requires {{ }}. This allows you to tell tidyeval functions that you have the name of a df-var stored in an env-var. Do you have specific cases that you find frustrating?


(1) The big problem I have is transitioning from RStudio to a pipeline (so I end up not using RStudio). A traditional pipeline is going to be a script with some set of arguments -- parameter values, fitting functions, and data file names, that I put into a shell script and say:

my_plot_script.R --plot_col=g_max --output_type=pub_quality data_file1 data_file2 data_file3

It's possible to use optparse/OptionParser() to get that information (but you have an option for every argument, no --param1 X --param2 Y file1 file2 file3) but it is much more difficult to fit those arguments into the RStudio environment. I want an RStudio to be able emulate reading command line arguments (since they do not exist in RStudio). Right now, I have to check to see if there are commandArgs(), and, if not, do something else to get the information to the RStudio script.

(2) There needs to be an option that says STOP if something doesn't make sense. I have dozens of beautiful data plots that look great, but in fact do not in fact plot what I think they do, because factors have not been properly assigned to colors, shapes, or linetypes. (And it can be really hard to recognize that the data has not been plotted properly.) Give me an option that says, if I did not explicitly declare a column a factor, and I did not specifically associate colors/shapes/lines with factors, then the data will not be plotted.


(1) You might want to check out https://github.com/t-kalinowski/Rapp by my colleague Tomasz

(2) I think part of that is in scope for strict (https://github.com/hadley/strict). You might also be well served by adopting some more data validation tooling, e.g. pointblank (https://rstudio.github.io/pointblank/).


On point two, can’t you just use stopifnot(condition)? Then log it etc?


Hey Hadley!! Personally only issues for me with integrating R is making renv play nice in multistage docker builds. I found that I need to have my other pipeline software built in the same stage as my R env setup (building specific version from archive, system dependencies, then r package dependencies via renv)


It’s been more than a few years since I worked in an R shop. While I loved wrangling and plotting data in the tidy verse I did find that the dependency management story in R to be even worse than Python.

Maybe that’s the problem?


This guy is the man to ask ^^^^^^


> We found R has a really hard time integrating into data pipelines and was best used as a standalone tool by individuals

In my org we have several 100% R teams (including mine) that have been developing and maintaining business-critical, data-intensive applications for a decade now. We don't find R difficult to integrate into data pipelines. We write our data pipelines in R, and we find it very efficient to do so. They talk to databases, APIs, command line tools, etc without issue.

Doing what we do in Python is unimaginable, especially if pandas is the tabular lingua franca in the team. I vehemently agree with this article on the clunkiness of pandas from a sister comment: https://www.sumsar.net/blog/pandas-feels-clunky-when-coming-.... Compared to dplyr and the tidyverse, pandas very noticeably gets in your way rather than being a tool of thought. (For what it's worth, there are other teams in my org that use Python for entirely justified reasons, and they use polars these days, not pandas.)

If I had to complain about anything in R these days, it would be the increasing complexity and illegibility of error messages. Tidyverse tracebacks are often dozens or hundreds of lines. This is made much worse if you have a web app in the Shiny framework, as Shiny seems to mangle and garble what little useful information you can get (my kingdom for an error with a file name and line number). Even outside of advanced packages like Shiny, the reporting of error messages suffers from some clunkiness and irregularity.

As an expert user, I can usually squint at the error barrage and infer what is really going on, but it's probably quite confusing and off-putting to newer users.

Overall though, I'm not seeing any competition for R in our space. My fondest hope is that in the coming decades there arises a new, thoughtfully designed language with the Lispy flexibility of R, but also optional type safety and static analysis affordances. I'm not sure if that's even possible, but I hope the computer science geniuses figure out a way.


If you have specific issues around error messages and tracebacks please feel free to let me know directly or to file issues on Github. We really do care about the legibility of errors and tracebacks and me and my team have put a lot of effort into them in the last few years. But there's always room to do better and I'd love to know where the pain points are.

(The intersection of tidyverse and shiny tracbacks are a known pain point that's hard to resolve. Unfortunately shiny and tidyverse did a bunch of parallel work that took us in slightly different directions and now it's hard to re-align.)

One thing we are missing is a guide to reading traceback for newer users. Often experts can get a good sense of where the problem is, but we've failed to teach newer users how to get the most value from a traceback.


There's clearly been a ton of progress in this area; the only issue is that feature development is even faster :) I'll keep an eye out for specific issues that seem helpful to raise.

The biggest one I have right now is a little niche, but probably useful to address. Moderately complex dbplyr pipelines on wide tables have a tendency to generate very long queries, and if there's an error, the generated SQL returned tends to overflow some text or line limit allotted to show the error at the command prompt. My workaround is to use sink() to dump the error to a file, which is a little painful as the sink() API and documentation are not the most straightforward or intuitive. (Hmm, I wonder if a withr wrapper would help me make something simpler to use...)


Hmmmm, I think that's something we could probably help with in dbplyr by providing something like `last_sql()` that would return the most recent SQL sent to the database. (By analogy to ggplot2::last_plot() and httr2::last_request()/last_response()).

I filed an issue so I don't forget about this: https://github.com/tidyverse/dbplyr/issues/1471


> My fondest hope is that in the coming decades there arises a new, thoughtfully designed language with the Lispy flexibility of R, but also optional type safety and static analysis affordances.

I think many of us saw Julia as the successor to R. Unfortunately, the package ecosystem---one of R's strongest points---still has a long way to go.


I was excited about Julia too but it now seems to be a relatively niche HPC language. It's about saving CPU time more than user time.

My sniff test for a successor language to R is whether it can replicate the tidyverse API with 100% fidelity. The API is already optimal for tabular data analysis, especially the dplyr core. It can be thought of as a specification for other languages to implement.

There is a great deal about how R works that is negotiable. But if the language can't implement dplyr to spec, or somehow doesn't "want to", it's not the language for the audience served by the tidyverse.


Follow-up: I did a little legwork and found some Julia folks who were also inspired to implement tidyverse as faithfully as possible: https://github.com/TidierOrg

Here's how dplyr-style chains look in their system:

  using TidierData
  using RDatasets

  movies = dataset("ggplot2", "movies");

  @chain movies begin
      @mutate(Budget = Budget / 1_000_000)
      @filter(Budget >= mean(skipmissing(Budget)))
      @select(Title, Budget)
      @slice(1:5)
  end
Not a character-for-character match to dplyr, but gets much closer than most other attempts!


I love this framing :)


Why Polars and not Dask?


Basically with tidyverse, R can let you write less code and keep it readable: https://www.sumsar.net/blog/pandas-feels-clunky-when-coming-...

Can't speak to abandonment, but it seems a lot of recent devel is occurring inside the the tidyverse, which is deprecating a whole bunch of other stuff.


I will agree that I left just as tidyverse was coming of age and I'm sometimes jealous i never got to use it.

What Hadley Wickham has done is very impressive.


R is the better EDA language by far. Python has caught up a lot. Notebook diffs are now readable in git with the right tooling, that's huge.

The drum about not fitting into data pipelines... if you're literally using a bash pipe its true most R programmers have no idea how to do that. Otherwise, that is where Docker and k8s shine.

On packaging. R's package authority runs tests and ensures that all packages work with the latest version of their peers. The dependency heck is much less deep as a result.

We use R at my employer still because we put statistical data science into production. Our experts come to us comfortable with R. Reimplementation would be absurd.


I think its more intuitive for statistical applications where Python is grossly under-represented. This includes things like the design and analysis of experiments but also lots of domain specific statistics and algorithms such as in bioinformatics, chemistry, and so on.

Typically those applications are not the sort of line-of-business enhancements ML in Python is more tuned to. I.e. recommender systems, NN models, and so on.


The consistency of model specification across multiple libraries is really helpful (base lm, lme4, brms etc). Even though the syntax is sometimes extended, it seems consistent enough to mostly be comprehensible/guessable.


Why I still use R for analysis at work:

- R Markdown is just great for static reports. We use PowerBI or ArcGIS for interactive stuff.

- GIS is a breeze. My work provides licenses for ArcGIS, which has a Python library for scripting. Despite that, it is so much easier to do stuff in R, which can read and create ArcGIS shapefiles.

- Exploratory data analysis is easy. Often, before meetings, I'll connect to the database in R and make a few basic tables. Then I can query, aggregate, or plot data sitting the meeting. I have custom ggplot themes in a package, so even my happy hastily created plots look nice.

- RStudio is amazing. What it lacks in editing tricks, it more than makes up for in simplifying R-specific tasks. Showing plots is automatic, rendering and viewing markdown reports (of any type) is two buttons, testing and building a package are each two buttons.

- I spent a lot of time evangelizing R (team-wide presentations, being the "R guy" for troubleshooting, organizing an R User Group with members from different teams, creating an internal package repository). Some became happy converts, the rest begrudgingly accepted it as a tool we would use. I don't know if I could do it again with another language.

I'll admit my work doesn't get incorporated into pipelines. We get the data, analyze it, create reports, and share the reports by email or on our public website. The statisticians are segregated from the developers here. State government resists change, especially role changes that don't match grants' or laws' wording.


> R Markdown is just great for static reports.

Quarto (also supporting R) is a good replacement for rmarkdown (with a saner syntax) and I say this as someone who has extensively used rmarkdown over the years.


> What we found was that R had alot of packages but most haven't been touched in years and when you contact the owner you find they've often moved onto the python/pandas/scikit eco system

As a "bilingual" R & Python user, I've found this to be true for the latter language as well :)

I don't have much to add on top of what other useRs have mentioned, except another testimonial that our company has successfully used R in production for 6+ years, from data "pipeline" stuff you mentioned to dozens upon dozens of predictive models of varying complexities.

When faced with a new data analysis ask, 99%+ of the time I reach for R (although without the tidyverse, that number would be much lower). Like another commenter said, the ease by which you can plot in R blows Python away. Seaborn seems like a decent compromise in my limited experience, but plotting in "base" matplotlib makes me want to die.


Plotnine is a pretty rocking ggplot clone in Python. Just import star and you're golden.


We (Posit) have hired Hassan (the maintainer of plotnine) so this is great to hear :)


It definitely is! If you could hurry up and destroy Jupiter notebooks that would be sweet ;)


I’ve transitioned a lot of my work over to Julia, but R is still the most intuitive language I’ve used for scripting out data collection, cleaning, aggregation, and analysis cases.

The ecosystem is simply better. The folks who maintain CRAN do a fantastic job. I can’t remember the last time a library incompatibility led to a show stopper. This is a weekly occurrence in Python.


> I can’t remember the last time a library incompatibility led to a show stopper.

Oh, it’s very common unless you basically only use < 5 packages that are completely stable and no longer actively developed: packages break backwards compatibility all the time, in small and in big ways, and version pinning in R categorically does not work as well as in Python, despite all the issues with the latter. People joke about the complex packaging ecosystem in Python but at least there is such a thing. R has no equivalent. In Python, if you have a versioned lockfile, anybody can redeploy your code unless a system dependency broke. In R, even with an ‘renv’ lockfile, installing the correct packages version is a crapshoot, and will frequently fail. Don’t get me wrong, ‘renv’ has made things much better (and ‘rig’ and PPM also help in small but important ways). But it’s still dire. At work we are facing these issues every other week on some code base.


I'd love to hear more about this because from my perspective renv does seem to solve 95% of the challenges the folks face in practice. I wonder what makes your situation different? What are we missing in renv?


Oh, I totally agree that ‘renv’ probably solves 95% of problems. But those pesky 5%…

I think that most problems are ultimately caused by the fact that R packages cannot really declare versioned dependencies (most packages only declare `>=` dependency, even though they could also give upper bounds [1]; and that is woefully insufficient), and installing a package’s dependencies will (almost?) always install the latest versions, which may be incompatible with other packages. But at any rate ‘renv’ currently seems to ignore upper bounds: e.g. if I specify `Imports: dplyr (>= 0.8), dplyr (< 1.0)` it will blithely install v1.1.3.

The single one thing that causes most issues for us at work is a binary package compilation issue: the `configure` file for ‘httpuv’ clashes with our environment configuration, which is based on Gentoo Prefix and environment modules. Even though the `configure` file doesn’t hard-code any paths, it consistently finds the wrong paths for some system dependencies (including autotools). According to the system administrators of our compute cluster this is a bug in ‘httpuv’ (I don’t understand the details, and the configuration files look superficially correct to me, but I haven’t tried debugging them in detail, due to their complexity). But even if it were fixed, the issue would obviously persist for ‘renv’ projects requiring old versions.

(We are in the process of introducing a shared ‘renv’ package cache; once that’s done, the particular issue with ‘httpuv’ will be alleviated, since we can manually add precompiled versions of ‘httpuv’, built using our workaround, to that cache.)

Another issue is that ‘renv’ attempts to infer dependencies rather than having the user declare them explicitly (a la pyproject.toml dependencies), and this is inherently error-prone. I know this behaviour can be changed via `settings$snapshot.type("explicit")` but I think some of the issues we’re having are exacerbated by this default, since `renv::status()` doesn’t show which ones are direct and which are transitive dependencies.

Lastly, we’ve had to deactivate ‘renv’ sandboxing since our default library is rather beefy and resides on NFS, and initialising the sandbox makes loading ‘renv’ projects prohibitively slow — every R start takes well over a minute. Of course this is really a configuration issue: as far as I am concerned, the default R library should only include base and recommended packages. But it in my experience it is incredibly common for shared compute environments to push lots of packages into the default library. :-(

---

[1] R-exts: “A package or ‘R’ can appear more than once in the ‘Depends’ field, for example to give upper and lower bounds on acceptable versions.”


Agree with this, I am pretty agnostic to the pandas vs R whatever stuff (I prefer base R to tidyverse, and I like pandas, but realize I am old and probably not in majority based on comments online). But many teams who are "R adherent" folks I talk to are not deploying software in varying environments so much as reporting shops doing ad-hoc analytics.

For those whom want to use both R/python, I have notes on using conda for R environments, https://andrewpwheeler.com/2022/04/08/managing-r-environment....


Can you not just build your own code as a package and specify exact dependencies?

It's a bit of faff but that seems like it should work (but maybe I'm missing something).


I basically don’t use anything outside of tidyverse or base R because of the package dependency issues.


At my old job we snapshotted CRAN and pinned versions of package dependencies _against_ CRAN.


We now provide snapshotted CRAN binaries (for many platforms) at https://packagemanager.posit.co.


R's biggest moat in my opinion is its much saner package management system and lower propensity to curb stomp existing libraries and projects with breaking changes.

As a SWE I much rather inherit and maintain R services than Python services.


Bioinformatics, particularly genetic mapping and population genomics. There’s an entire ecosystem of very mature tools actively maintained by labs to add analyses pertaining to advancements in the field, without breaking pipelines or silently changing the results of a given analysis from version to version.

Take something like adegenet, where the manual itself is approaching 200 pages:

https://cran.r-project.org/web/packages/adegenet/adegenet.pd...


Pk/PD work for pharmaceutical data analysis I didn’t like using R at first but I’ve come to appreciate the speed that comes with months of experience.

It’s a language which feels like it has a lot of magical incantations you need to remember - the default namespace is much more crowded. Functions like sapply vs mapply are tricky to reason about from the documentation alone. The values NA vs Null vs integer(0) are all used as standins for real thrown errors and knowing which one to check for after calling a function can be tough.

But after using it for a few hundred hours to do data processing and statistical regression it’s hard to imagine python or Julia being faster to use. But in all honesty for the pharmaceutical industry it’s mostly momentum that keeps R on top same reason they use a lot of FORTRAN90.


> But in all honesty for the pharmaceutical industry it’s mostly momentum that keeps R on top

I can’t agree with this: especially in PK/PD, R is only just now taking over from the previous (closed-source) systems. Momentum would keep R out, not in.


> Functions like sapply vs mapply are tricky to reason about from the documentation alone.

Could you please expand on that? It's unclear what you're referring to.

> The values NA vs Null vs integer(0) are all used as standins for real thrown errors and knowing which one to check for after calling a function can be tough.

`checkmate::assert_numeric()` (or similar)

with base R you want isTRUE():

`stopifnot(isTRUE(is.finite(x)))` (or is.na or anything else) will error on empty values.


I’ll check out those assert functions and I only meant that having apply,lapply,sapply,mapply took some getting used to most programming languages I had used prior include only 1 map / apply function and it’s still not clicking for me when there are performance penalties choosing one over another although I’ve read in the documentation that there are pros and cons


Bayesian stats

Traditional stats

Very fast iteration for data exploration in REPL (vs code or R studio).

Prefer pipeline workflows (Tidyverse/maggrittr).

Prefer functional

Prefer array based.

Prefer 1-indexed arrays (yes there are some of us).


There is currently no Python equivalent for both the ease of use and output quality of ggplot2 for data visualization. Many have tried over the past decade, but none have gotten close. (Plotnine was the closest: per Hadley in another comment his company hired the maintainer)


My main use case is making high quality visualizations for quick data exploration and sharing with a team. It is easy to guarantee that fonts are large enough, style is minimalist and clean, and filtering, transforming views or facets iteratively is only a couple characters change.


Exactly what you said, R is easy to get started for individuals in social science fields. Most people I know who want to dive deeper end up learning Python anyway.


This is true, I've found teaching r to get things done is very fast and readable fully achievable in a semester compared to teaching pandas (and also having to teach how to program in python)


this is kinda funny to me. practically every social scientist I’ve known that had to use R was appalled by how cryptic and unintuitive it was.

I always understood that it was its open source nature being a successor to the S language that gave it traction


And people wishing to delve even deeper switch back to R? I don't use it, but I understood that most advanced techniques in statistics are implemented first, and sometimes only, as R packages, no?


R is much better for REPL style development and functional programming.

Python could be so much better with some minor syntax extensions.


I find that with vscode and the immediate window I get a decent repl.

What about R's language makes it better for Repl driven development?


If you haven’t used Emacs ESS it may be hard to explain what you’re missing with a true REPL, but if you had used it in the past and add tidyverse to it, you basically have super smooth interactive editing with the ability to quickly pool text from other REPLs, past notes or scripts. Contrary to the similar Python repl, you can easily pull and edit chains of multiple commands via R’s piping.


For one thing, R code can be written more concisely, due to the fact that the language is vector-based and functionally-oriented.


RStudio is a nice tool for making some quick graphs on data, descriptive analyisis and quickly exploring a dataset. Building some reports, or manipulating small datasets for beautiful graphs.

For anything else, we use Python.


two reasons for me

1) tidyverse makes prodding and plotting my data faster and more enjoyable. when I am prototyping a model I'll sometimes do the groundwork in R and then migrate the production version to python

2) I can't seem to write data wrangling code in py that is as aesthetically pleasing and easy to reinterpret later. could just be that I started in R, but while the methods in pandas "work" I don't always totally understand why they work the way they do. with tidy it works the way I expect and feels easier to read back and iterate on


I quit R a while ago - before data science became a thing - and switched to Julia for such tasks. R has lots of stats packages, but it is too esoteric and specialised a language to be useful IMO.


I'm an old R user forced to mostly use python because that's what the team uses.

R is so much better than python in many areas concerning data pipelines: connecting with external database systems through an unified API, superior data munging utilities, as well as plotting, a more comprehensive (obviously) statistical analysis toolset.

I even find rmarkdown vastly superior to jupyter.

But IMO the best reason to use R rather tha python is that its tools will make you approach the problem as a statistician rather than a programmer.


>what is your use case?

if you are doing Bayesian stats, fitting hierarchical models, or using Stan in any serious capacity, R/Stan is so much more ergonomic than Pystan. Here’s a long list of pros-cons:

https://discourse.mc-stan.org/t/various-observations-on-rsta...


What python learning material you recommend focused on data science?


I like this and they have a video course on o'reilly https://www.amazon.com/Python-Programmers-Artificial-Intelli...


I stopped using it in 2015, when I began to learn how to code.

At my FAANG company, there are teams that use it for econometrics. I think that’s Rs sweet spot, still in 2024.


im somewhat of an armchair data scientist myself.

However - I'd love to learn your ways; specifically - what are your best recommendations for python over R?

Specifically, even though my R skills are weak - I think that RStudio is pretty darn amazing - what do you recommend over Rstudio?

I'd truly like to hear what a good toolbox looks like from your perspective these days (especially now this little GPT toddler is bonking into everything in my domain)


>For those of you who us R still what is your use case?

Still the best replacement for EDA and reproducible analysis that used to be done in Excel.


I had thought this book was redone by someone in Python. Does anyone remember seeing that?


tidymodels is miles ahead the toys you have in python for traditional machine learning. of course Python is much better in other areas but that is a big reason to use R, together with the super powerful tidyverse syntax.

and package management is much, much more reliable in R than in python.


Is tidy models better than sklearn? As honestly sklearn is one of the few things I was jealous of from the python ecosystem, historically.


Did you work with Rstudio Server and still found it not collaborative enough?


Was there any performance difference between R and Python in your case?


In Google "data science" circa 2009 (although we didn't call it that), R was the weapon of choice.

I consider it a bad relic of the 70's. It doesn't have a "learning curve" -- it has a "learning straight line." Even when you're experienced and semi-competent at it, it's still difficult and surprising.


R is the PHP of data science. It is productive, it has a large ecosystem, lots of functionality, but it grew fast and organically and not in well planned manner, making it not consistent and a bit messy to work with.

If you have to use R, use the tidyverse.

https://www.tidyverse.org/

I like R and use it often as it find it more concise to work with than Python for simple statistical purposes. I forced myself to use R instead of spreadsheets and don't regret it.

This is one the reasons why (thanks, Zed Shaw) https://web.archive.org/web/20110702162929/https://zedshaw.c...


I took a two or three day on-site intro to R class that my employer put together. Perhaps it was not a great class, but as a seasoned software developer familiar with a number of imperative and functional languages I was baffled by R. It felt like a bunch of little functions that had been developed by different people with no consistent framework, and thrown together in some kind of big wrapper. I know it's popular among statisticians and researchers, so I think a prerequsite must be a good fluency with statistics (I don't have that). Maybe it makes more sense if you think like a statistician. As a programmer I felt like nothing I learned about R contributed to developing an intuitive understanding of any of the rest of it.


R definitely has its warts, but I strongly believe that underneath them lies a beautiful and quite elegant language that's extremely well suited to the challenges of data analysis. If you're already a programmer, you might find something like Advanced R (https://adv-r.hadley.nz) to be useful to get a sense of what R really is as a programming language.


I think of R as a programming language designed by people who’d heard about programming languages but never actually used one before. It’s great for ad-hoc analysis without having to think about production systems.


I get a similar impression but to contextualize, in terms of statistical programming what you’re saying is even more so true of what came before R, but a thousand fold worse. In that context R is fantastic.

For example SAS makes R look beautiful and consistent. And that’s more a comment on SAS than R. And this isn’t to say python is perfect either, but I prefer it.


I'm looking at R seriously for the first time.

I've got a decade in with Python numeric computing, and I'm interested in Julia and all of the cutting-edge stuff.

I've only dabbled with R until now, and I haven't researched it enough to know if rumors of it's inevitable demise have any substance.

There are a lot of interesting math problems other than training gigantic neural networks on NVIDIA gear, and I've got some Computer Algebra System / ergonomic linear modeling needs on a current project:

I need the best tool for someone who is messing with Black-Scholes type stuff, who is still building the fidelity with tricky antiderivatives by hand, but I have enough fundamentals to check the computer's work.

What role should R play here?


I love R. You could do it R. But a lot of the derivations and Math Finance stuff you can and should be able to do in C/C++. R packages mostly depend on those as well for heavy duty calcs.

So, if I wanted to dabble I'd easily use R and if I was in the quant developer world I'd be doing C/C++


I work with a trading team that manages $1B, exclusively with R.


Second this, seen lots of funds use whatever language their lead QR/QT feels comfortable with. At the end of the day, if you aren't running a strategy that requires colocation on the exchange, whatever speed improvement you get from the language will usually disappear from the network latency.

Something like intraday momentum/sector rotations can easily be done entirely in Python/R, from what I've seen.


Likewise interested if a pro has any consulting hours to spare :)


Sorry, unfortunately do not do consulting.

In your other comment, you said you are looking to price "weird derivatives". How weird are we talking? If its OTC I won't be able to help anyway, if its standard then I can at least try to point you in the right direction. The fact you mention Black Scholes makes me think it might be something closer to "vanilla" than the other way around.


It’s looking like the goal will be to create downward pressure on derivative beta (especially in the case of a rapidly increasing underlying: big pools of Hopper cards basically).

I have a vague intuition that transaction costs will be sort of cumulatively symmetric: participants who get in quickly will pay a lot per unit time, but conversely, people who VWAP in will get zero-rated on the way out.

There’s a legitimate underlying switching cost, there’s a stability premium thereby, making that equitable for all participants is an interesting problem.


Am I correct in understanding that you have the spot price of Hopper card compute as your underlying and then come up with a pricing equation for some derivative instruments for that?


In a friction free scenario it would be a standard future, yes.

The reality is closer to an option on an FX forward, with a very nasty empirical MC as Q* for the payoff equivalence.

I’m not fancy enough, I know when to sub-contract!


Have you tried looking at SABR?

If what you have is FX-like I wouldn't be able to help beyond that anyway, FX modelling is its own thing and I haven't done anything there since the obligatory uni courses(in equity space myself). AFAIK the general way to do things in rates/FX is SABR for vanilla and then PDE/MonteCarlo for exotics, but I was never on an FX desk so don't want to point you in the wrong direction.


As I’m sure you can tell, pricing exotic derivatives isn’t my day job.

But your reminder to think of SABR/implied-vol is useful: I think there’s a convexity argument that can be made around how fat the tails would need to be.

I’m not sure anyone is going to be thrilled at “anywhere between one hundred dollars and one hundred million dollars”, but my job is to figure out the bounds.


I have to price some weird derivatives.

You do any consulting on non-adjacent areas.


I've done some work for scientists where they used C++ extensions to R for heavy number crunching. For their workflow, R is really nice. Don't know how common this is though.


Rcpp is pretty common in major performance sensitive packages. The CppCast did an interview with Dirk Eddelbuettel about it in 2022:

https://cppcast.com/rcpp/


yeah, that's what it was, I was taking apart and documenting an RCPP module they had, but for which they no longer had access to the coder. It was pretty cool work, would be happy to do more with RCPP


quite easy to price derivatives with R. I have a degree in finmath from uchicago, where derivative pricing was taught using Matlab and R. But in the last semester we were told - oh yeah when you go out there into the real world and start working for the banks you can’t use civilized tools like R and Matlab. So you have to take this mandatory class on cpp. There once was a guy named stroustrup and this shit here is called a makefile… after graduation i worked for BofA and yes, the quant world is completely C++. But there are small funds (few billion dollars) that do their own shit in R, Haskell, Q/kdb, others. Very doable in R.


CS50 will be also available with R starting this summer https://www.edx.org/learn/r-programming/harvard-university-c...


I love the power of R, especially when used for "stupid" stuff [^0]

+ extra points for using quarto

[0]: https://gist.github.com/mine-cetinkaya-rundel/03d7516dea1e5f...


I dipped my feet into R a few years back, but eventually stopped it because of the way it handles integers. At the time it treated all integers internally as signed 32-bit and if the number is too large for that it converted it to a float.

I don't know what R does now, but this was a deal breaker for me at the time because I was dealing with really large integers that regularly broke this limit.


Integers are still only 32 bits. There’s a class which effectively represents 64-bit integers (https://www.rdocumentation.org/packages/csvread/versions/1.2...) as well as arbitrary-sized (https://cran.r-project.org/web/packages/gmp/index.html, https://www.rdocumentation.org/packages/gmp/versions/0.7-4/t...). I will say there are a few pitfalls where the integer bits are unexpectedly converted to something else, but it’s workable.


Finally a real reason.

A lot of the stuff above was complaining about issues where Python is a lot worse than R, about non-issues or with a fundamental misunderstanding of the language. I'd given up hope of seeing a real weakness named as such :)

There is bit64 and doubles being used as 53bit pseudo-integers - but if I needed 64bit integers, R wouldn't be my first choice, definitely.


What makes this book different from R for Data Science by Hadley Wickham, Mine Çetinkaya-Rundel, & Garrett Grolemund?


Unfortunately there are 56 other data science with R books, so what is the differentiating factor here?


It's the Harvardx course


In my case, I found R a better tool for learning DS, as it is more or less, a DSL for statistics, and feels more low level and fores you to learn more fundamentals than python. For production it is probably worse tan python, true.


It's not a DSL.


It is de facto.


Why do you think that? (I'm legitimately curious.)


Not parent and I also consider R to be a surprisingly solid general purpose PL but the one reason that I can think of is that formulas are part of the language. Formulas are extremely well-suited for data analysis but odd outside of that.


Good point. I'd add complex numbers and "everything is a vector" to that.

Of course that doesn't make it a DSL. It simply means that R was designed with a particular application in mind. So was Perl (regular expressions as first-class citizens) or Javascript (DOM manipulation). Not to mention PHP.


That is true. Also Erlang with the actor model, and Go with its goroutines for network services.


This is why I said "de facto", not "by design". Yes, you can use it outside stats domain, but it will suck big time for that. PHP used to be like that as well, but later versions look more usable for general scripting.


No R is not a "surprisingly solid general purpose PL". Not in comparison to better option such as Ruby, Python, Perl etc.


But yes in comparison to Stata which is its main competitor in the space where I am using it. This does make it possible to write much more modular, legible, and well-tested R code than Stata do files.


That's a comprehensive guide. If anyone wants a similar introduction, with interactive exercises to try while they study this is also a good resource: https://www.codecademy.com/learn/learn-r


No better tool for EDA and data analysis than R and RStudio. Fell in love in stat 133 at Cal and now while I am doing software engineering I have very fond memories of writing R and tidyverse


See also:

"R Programming for Data Science" https://leanpub.com/rprogramming




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: