I have a related technical question. Why couldn't something highly embeddable like SQLite be the default underlying implementation for a data frame in something like Python or R? It seems like Pandas and R data frames have a great deal redundant functionality.
SQLite seems like it has the guts to be the standard libdataframe.c for R, Python, Julia, etc. As a side benefit it already has a super consistent API (a.k.a. SQL).
Because it's designed to support typically relational db workloads (i.e. Lots of changes) not data analysis workloads. Dataframes in R, pandas etc, are column oriented, which leads to better trade offs for analysis.
Also SQL is a substantially inferior API for data analysis. (Not because it's a bad language, but again because that's not what it's designed for)
I agree completely that SQL is not the language for the kind of data analysis you're discussing in this book - to me, the question is whether it's useful to do querying and filtering through SQL and data analysis through python and R on the resulting datasets. I think pretty much everything you've written here would be continue useful if you used sqldf to generate data frames in R, but I don't know R well enough to be sure of that.
Because pandasql returns a data frame from a data frame (not sure if this is the case with R), I find it relatively easy to do data things with sql and data analysis with python. However, that's not a huge surprise since I've been using SQL for a while but don't know the pandas or R data frame syntax especially well.
I'm not sure why sqlite was chosen - could it have to do with the in-memory nature of dataframes? So far, my use of sql with data frames has been pretty generic, so I haven't bumped up any implementation specific SQL issues.
I think teaching multiple languages would make life much harder for new learners.
Also window functions are really useful for data analysis, and they are much easier to express in dplyr than they are in SQL (at the cost of being slightly less general).
Thank you for the reply--I immediately started Googling for more info about column-oriented data stores to see if there was something analogous to SQLite in this space. It looks like there's an embedded MoneDBLite package for R now that I'll need to check out.
SQLite seems like it has the guts to be the standard libdataframe.c for R, Python, Julia, etc. As a side benefit it already has a super consistent API (a.k.a. SQL).