Tables and data frames are both leaves in the far more fundamental flow that beginners don't pick up because it is too simple - the relational model of data. The real problem is the the basic normal forms are so obvious and simple it is difficult to tell if people designed around them on purpose or stumbled onto the right path.
I suspect the distinction between tables and data frames (and arguing about query languages) is from people who have lost sight of the important things when dealing with data - the relational model and a system that supports relational algebra. Beyond that, features and optimisations need to be justified in terms of today's (and tomorrow's) needs which are specific to the data. The only major issue with SQL is a lack of support for something like R's tidyverse gather() and spread() operations which could be considered as missing operations in relational algebra.
I think one limitation of what you're saying is that you need to process the data before they become relational model. You need to integrate and normalize, and therefore, at some point, you need to support datasets that are not in normal form. Or perhaps you don't even need to normalize.
But I also agree with you. I think we are teaching programming wrong, we start with imperative programming, but perhaps we should start with data modelling. How the data look like? What are the constraints? Can we test the constraints? How many bits I need to store it?
"Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious." -- Fred Brooks
But that was exactly how programming was taught 30 years ago. Relational data modeling was your starting point. Then object orientation came along and we suffered the dreaded object relational impedance mismatch when we couldn’t translate invoice header to invoice line-items in an object oriented way. Yet somehow the majority of development continued down this design philosophy, abstracting the data model further and further away (it was only heresy for a short while that data models were being generated by tools of process models). Then NoSql. Then graphQL dragging the lamppost of software development further away from understanding data first.
The central design premise for relational data modeling was “if your model could potentially allow inconsistencies, assume the inconsistencies”. Today that premise is easily brushed aside with “you’ll never get anyone manipulating this table without going through this layer” or “we then group by these columns to show unique records to the user”.
I think NoSQL is only as popular as it is because a lot engineers don’t have a proper understanding of the relational model (which isn’t hard to understand at all, it just seems to have fallen out of popularity). I certainly think there are completely valid use cases for denormalized datastores, but I don’t think those use cases are what’s driving their adoption. You can see this in how so many mongo/dynamo... apps end up just being semi-normalized, relational-ish databases.
I also think another reason is that relational database interfaces don’t really fit in with the architecture people want to use these days. Products like dynamo have secure and operable HTTP interfaces and SDKs that fit in really nicely with the ‘serverless’ stuff. To run a relational database you pretty much need to run a network, which isn’t particularly compatible with such architectures.
I think a factor in this is that NoSql databases have nice API's that programmers can use to setup tables, do simple queries etc, which makes it much easier to get started.
For RDBMS's you have to muck around with connections and SQL, which is more powerful but requires much more ceremony. (Connection pooling, prepared statements etc)
The lack of understanding of the relational model is not the limiting factor in my experience, the developer experience is just much worse.
> but requires much more ceremony. (Connection pooling, prepared statements etc)
NoSQL databases need you to go through this ceremony too. It may be less obvious if you're just passing JSONs over HTTP all the way, but something still needs to keep connections and sanitize untrusted input.
It's not just ORM. Many NoSQL databases allow for real-time events (query subscriptions), a simple security model, built-in data versioning, built-in sharding. You rarely get any of that out-of-the-box for a traditional RDS.
Judging by my previous managers way of describing things ("give me if x then y"), many people understand the relational model on a basic level, but can't think in sets when describing the output they are looking for ("give me x where Y"). While you can get "if" statemnts in SQL, it's not the way you should be thinkng when doing any sort of non-trivial query.
The dreaded generic object table* is something I think people have put in SQL databases forever and will continue to do so, so it seems superficially logical to me to say "why not just use an object store of some sort?"
*I have experience with a system, not designed by me, that had one, and we were always going to redo/split it but never did.
"Serverless" is just "someone else's servers". I might be missing something big here, but I don't see why - in principle - a "serverless", SQL-over-API-as-a-Service couldn't work.
It’s also an abstraction on top of those servers, that separates your business logic from the underlying architecture to a degree that a lot of people find really appealing.
> but I don't see why - in principle - a "serverless", SQL-over-API-as-a-Service couldn't work.
It can, but there’s just not really any good ones. Where as there are products like Dynamo which are amazing from an operability standpoint (as long as your use case doesn’t run up too hard against any of its constraints). AWS Serverless RDS is pretty terrible for example, the engine choices are limited, the scaling story is terrible, it’s expensive, it doesn’t actually have anything close to the “on-demand” functionality described in the marketing material, and the interface is just a mechanism for passing SQL queries around (so you’d probably want to use yet another abstraction layer for constructing your queries). Spanner is pretty good, but it’s really expensive, and isn’t amazing enough on its own to justify moving to GCP unless you’re already there. You can also run an HTTP interface yourself for you RDBMS, but there isn’t really a mature product in that space. There’s no show stopping technical limitations there, it’s just not a well or widely supported feature.
I've never understood the myth of the "object relational impedance mismatch" (ORIM).
Objects are just a collection of attributes. Table rows are just a collection of attributes. There are some things that objects can do that the relational model can't but if you are designing a database schema why would you care about those things? There is no rule that you have to use every feature of a programming language everywhere. So just use the least common denominator of features and you're good to go.
Complaining about ORIM is like complaining about the universal serial impedance mismatch. You can't plug in USB 3 (OOP) in USB 1.1 (relational) but you can plug in USB 1.1 into USB 3.0 and since we know the old model is still good enough for a lot of use cases (think keyboard and mouse) we still use it even though we have a completely different standard installed into our computers.
What this means in practice is that your Domain/Entity classes look exactly like your database schema, not the other way. You still have to write queries but your ORM makes it ten times easier by offering very convenient query builders that let you build dynamic queries without string concatenation.
This misunderstands the relational model -- it doesn't mean "my app's data model". Ie., it's not that there's fields and rows.
It's that the data is represented with sets, and there's an algebra over those that provides (strong) guarantees and principled way of composing operations.
Yes, you use objects as mere key-value pairs and provide a bizarre semantics for the relational algebra over sets of these objects -- but! -- this isn't object orientation.
Eg., in OO objects compose -- in the relational algebra rows dont "compose", eg., even having person.address.street breaks the semantics of 'SELECT'
The interpretation of p.a.s has to be as a subset on a product of relations (ie., tables P and A filtered on a join of p.id to a.id, etc...)
This is one of the key impedance mismatches in OO<->Relations -- composition =/= join. Hence awkward and and ugly workarounds in all ORMs.
I always felt like the relational model, SQL, and the extensions and add-ons like PL/SQL come from people with totally different worldviews and make quite a mess together. I couldn't design something better than PL/SQL, but it just seems so, so wrong to me.
On the other hand, the purists* who rant about nulls, I think have missed something as well.
But on the whole, I would like to see something related to SQL, that tries to improve it without grafting things on, that has an overall vision and an approach of simplifying and making it more coherent. Not being committed to the syntax, and wanting to close the loop more between manipulating data and manipulating the language itself.
I don't know, if you say SQL is a functional programming language, ok, and you take some common functional language that has nice syntax and make it consistent with the relational model, what do you get?
*Of which my impression was formed by (IIRC):
"Relational Database Writings, 1989-1991 by C. J. Date"
(which Amazon seems to want over $1,000 for right now, in paperback)
I suspect the distinction between tables and data frames (and arguing about query languages) is from people who have lost sight of the important things when dealing with data - the relational model and a system that supports relational algebra. Beyond that, features and optimisations need to be justified in terms of today's (and tomorrow's) needs which are specific to the data. The only major issue with SQL is a lack of support for something like R's tidyverse gather() and spread() operations which could be considered as missing operations in relational algebra.