I'm the author, and I'm happy to answer any questions.
The book should be in print by (hopefully) the end of this year, or definitely by Jan 2017. The content will not change significantly, but there is will be minor fixes and a lot of proof reading.
Great book, I'm getting a lot out of the site and I'm looking forward to the release. Thanks!
I understand there is always one more library or topic that could be included...
.. but with that acknowledged, what do you think of sqldf as an alternative to dplyr? You mention that dplyr is a bit easier (within the context of being specialized for data analysis). I'd have trouble weighting in because I don't use R all that much, but I do really like the python "equivalent" pandasql.
Also, I've used SQL for a long time, so I'd have trouble at this point really knowing what's "easier" for someone new to both, but I do often find it easier to use SQL than do data frame operations in pandas. dplyr seems to be a closer cousin to standard SQL, so the difference might not be quite as great.
I wondered something similar about sqldf, because at one point my brain just seemed to work better "in" SQL.
The biggest issue I found was that sqldf was significantly slower than dplyr and other alternatives.
I started trying to mess about with something I was calling sqldf2. Didn't get very far, but there is some perhaps somewhat useful benchmarking in the R script here:
I have a related technical question. Why couldn't something highly embeddable like SQLite be the default underlying implementation for a data frame in something like Python or R? It seems like Pandas and R data frames have a great deal redundant functionality.
SQLite seems like it has the guts to be the standard libdataframe.c for R, Python, Julia, etc. As a side benefit it already has a super consistent API (a.k.a. SQL).
Because it's designed to support typically relational db workloads (i.e. Lots of changes) not data analysis workloads. Dataframes in R, pandas etc, are column oriented, which leads to better trade offs for analysis.
Also SQL is a substantially inferior API for data analysis. (Not because it's a bad language, but again because that's not what it's designed for)
I agree completely that SQL is not the language for the kind of data analysis you're discussing in this book - to me, the question is whether it's useful to do querying and filtering through SQL and data analysis through python and R on the resulting datasets. I think pretty much everything you've written here would be continue useful if you used sqldf to generate data frames in R, but I don't know R well enough to be sure of that.
Because pandasql returns a data frame from a data frame (not sure if this is the case with R), I find it relatively easy to do data things with sql and data analysis with python. However, that's not a huge surprise since I've been using SQL for a while but don't know the pandas or R data frame syntax especially well.
I'm not sure why sqlite was chosen - could it have to do with the in-memory nature of dataframes? So far, my use of sql with data frames has been pretty generic, so I haven't bumped up any implementation specific SQL issues.
I think teaching multiple languages would make life much harder for new learners.
Also window functions are really useful for data analysis, and they are much easier to express in dplyr than they are in SQL (at the cost of being slightly less general).
Thank you for the reply--I immediately started Googling for more info about column-oriented data stores to see if there was something analogous to SQLite in this space. It looks like there's an embedded MoneDBLite package for R now that I'll need to check out.
If you don't already know SQL or dplyr, I think you would find dplyr significantly easier to learn. Some people who do know SQL well have commented that they too find dplyr easier. I think this is because the scope of dplyr is much smaller than SQL and it is designed specifically to facilitate data analysis.
Trivial, self-serving question: is there a library for generating the diagram of table relationships here (13.2 nycflights13) http://r4ds.had.co.nz/relational-data.html
And of course, thanks for another great book, it's helpful for learning R but I'm always enlightened by how thoroughly you explain the general concepts (e.g. Relational data and joins). Have heard a few people on faculty speak enthusiastically about the book even as I hold out for more adoption of Python :)
It won't be as pretty, but I deal with large data models all the time, and like to use SchemaSpy [1] which generates an interactive page in HTML and can be used on the command line (I guess you could always modify the CSS to make it pretty). It's literally one of the most useful tools in my life, and the output is good enough to show to clients.
If I'm designing a DB or even just an SQL example, I'll run the code on my local machine (psql + the Postgres app [2]) or if I'm lucky, the client already has a server running Postgres and I can run it there instead. All SchemaSpy then needs is access to the DB and voila, interactive example.
Lucidchart has the ability to generate SQL for you to run and it'll generate a schema for you. I used it to figure out the schema for a particularly poorly designed DB I had to get data out of.
Hey Hadley. Huge fan of your work! Many of the libraries you have authored or co-authored have had a big influence on how I think about building tools. I looks forward to getting a hard copy of the book!
I have a bit of a nitpick about chapter 13 on "relational data", in which I believe you are consistently misusing the technical term "relation" to refer to the relationship between two data sets. In the context of relational database theory, "relation" is just another word for "table" (although it connotes more mathematical formalism).
I think it is worth respecting the precise technical usage in this case: Consider a student who might read your book and be told that "relations are always defined between a pair of tables," and that "a primary key and the corresponding foreign key in another table form a relation." The same student might also stumble across the wikipedia page for the relational model and learn that "a relation is defined as a set of tuples that have the same attributes," and that the relational model "organizes data into one or more tables (or relations)."
Technically, the relation (value) is the set of tuples (i.e. the data itself, "the true statements", the rows in the table), the relation variable is what is defined by CREATE TABLE (i.e. how the data is constrained, "what can be true") and what most people are trying to model.
The relational data model - the set of relation variables - is thus the "equation that defines what is going on in your company" and what the DBMS puts in - the set of relation values, usually abbreviated to relations - can be seen as "the history of what happened at your company".
I'm a software engineer who is already quite comfortable with Python and has more of an interest in machine learning than data science (as I understand it), is there any reason for me to learn R?
I'm in bioinformatics. R is huge, Python is a distant second. So I am forced to use both.
R is a terrible programming language. It's slow and syntactically inconsistent. For interactive statistical analysis it can be OK, but anything beyond a small program becomes unmanageable quickly. In particular, if you want to manipulate strings or hierarchical data structures quickly in R, good luck.
For ML, scikit-learn is almost always sufficient. For statistics, statsmodels is OK but very underdeveloped compared to what is available in R. IMO plotting is equally painful in both Python and R (seaborn is a good Python library to ease the pain if you haven't seen it).
Personally, I write everything in Python and call out to R as infrequently as possible using rpy2, usually only for specific statistical routines or bioinformatics-specific libraries.
Probably not any strong reasons. That said, if you're a software engineer, you shouldn't find it too hard to pick up enough R to be useful. You might enjoy <http://adv-r.had.co.nz> which describes R from more of a programming language perspective.
I use both R and python quite a bit. I prefer python as a programming language. Here's my take on 'Why learn R?':
(1) R/ggplot is hands-down better for plotting than anything in python. I also think that R is better for EDA generally.
(2) Many smart, knowledgable people use R and publish their code. To learn from it, you need to know enough R to read and modify it.
(3) R has better package support than python in several common data analysis domains. For example, in forecasting and in graph analysis, the best R packages available are much better than the best python packages.
I've always had a hard time understanding pipes, so it may just be that the concept will take more time for me to grok. I thought that the little bunny foo foo example in section 18.2 was really hard to grasp. I think that doing a similarly in depth example with numerical data, while more boring, may make the concept easier to understand.
I've been using r4ds for the past couple weeks. This is the first time I've really understood how everything in R and the tidyverse fits together. I am really enjoying the book. It has already helped me immensely.
I don't see how there would be a need for that. The code will work in Jupyter Notebooks just as it will work in a different IDE.
I came from Python using iPython. I missed them for the first few weeks, but now I can't switch from RStudio. It really is just such a great tool for data science.
The advantage of Jupyter notebooks over RStudio is having text and code in the same place. Personally, I find that being able to run+modify the code in a textbook is much more informative than simply reading the syntax (for example - https://github.com/CamDavidsonPilon/Probabilistic-Programmin...). Sure, I could just copy and paste from the website into an IDE, but notebooks are a more natural way of communicating code and prose, IMO.
In R we also have notebooks that does this just it is slightly different. RMarkdown is how we do this same thing. At first I missed the different code and text blocks but it is just easier to work with when it is all a text file. You can make the RMarkdown for reports and then just run them from the command line and never have to open RStudio or R. Saves me a ton of time.
You use back ticks to make your code chunks.
For example:
Some random Markdown text here is treated as a text block in Juypter
I agree, though I don't know if I'm just missing something when working in Jupyter, or if Python has an equivalent of RMarkdown?
Jupyter has some conveniences, but the tradeoffs aren't worth it for me. Working in a web browser has much less power, when it comes to keyboarding, than Atom/Sublime. And I generally don't need to interact with my data; I know what I'm outputting, I just want to show the results to readers alongside my code. I don't use RStudio but RMarkdown is easy to run from the command line.
By contrast, Jupyter requires (AFAIK) working within the browser and, when you save the file, you get a huge jumble of JSON, which is how the notebook is serialized. I tend to write a lot of vignettes/explorations and the need to full-text grep them is important to me and is not feasible when the text content is saved as JSON.
>Jupyter, or if Python has an equivalent of RMarkdown?
You can convert the Notebooks to different formats. Though they tend to have to be fixed up a bit for a script (Which I also might need to in R to turn it into a script)
The source is available in https://github.com/hadley/r4ds, and works exactly as you describe when using the preview edition of RStudio (and indeed that's how I write the book)
I just saw this the other day! This wrapper around Pandoc to make life our lifes easier (tidyr)? I just went through the process of making a markdown file to an epub before I saw this, wish I saw this earlier.
The book should be in print by (hopefully) the end of this year, or definitely by Jan 2017. The content will not change significantly, but there is will be minor fixes and a lot of proof reading.