Hacker News new | past | comments | ask | show | jobs | submit login
Why Are There No Relational DBMSs? [pdf] (2015) (warwick.ac.uk)
143 points by todsacerdoti on March 13, 2023 | hide | past | favorite | 111 comments



For the benefit of those looking to understand the premise of the question, The author’s complaints as to why SQL-based DBMSs are not ‘truly relational’ are enumerated as follows:

> Why did they neglect the obvious need for each of a table’s columns to have a name? Why did they allow a table to have two or more columns with the same name? Why did they decide to allow more than one copy of the same row to appear in a table? Why did they invent NULL and introduce that third truth value, UNKNOWN?

Where by ‘table’ he means ‘result of a query’.

So the complaint amounts to SQLs failure to fully ‘close’ such that the result of any query would automatically be an indexed relation exactly equivalent in capability to a table. Which I guess is something every SQL user has run into from time to time, and is a source of annoyance, but doesn’t seem to have prevented people from actually using SQL to… manage… relational data.

Also, much of this paper seems to be taken up with complaining that Codd didn’t think through all the details someone would need to iron out when actually implementing relational calculus, which seems like an odd place to lay the complaint. Codd demonstrated that a system based on relations was, in certain ways, complete and comprehensive (ie it could let you answer ‘any question’ about some data); that it didn’t also serve as an unambiguous spec for how to design a query language and engine seems to ask a bit much of the initial theoretical underpinnings of the entire field. Like complaining that Turing’s paper on universal computation doesn’t actually nail down how to design a CPU.


"So the complaint amounts to SQLs failure to fully ‘close’ such that the result of any query would automatically be an indexed relation exactly equivalent in capability to a table. Which I guess is something every SQL user has run into from time to time, and is a source of annoyance, but doesn’t seem to have prevented people from actually using SQL to… manage… relational data."

Load up the debate in your mind about the importance of generics and how important Optional is and how important sum types are, regardless of your personal feelings about them.

Now. Once upon a time, people wrote programs with nothing but global variables and GOTO statements. They didn't even have subroutines. They wrote code nevertheless. It was just bad code, for a much greater effort than they would need to write similar code today.

The fact that people solve problems doesn't prove they're doing it optimally and that there can't be any improvement.

The good news is, this problem continues to get better and better. I've been doing some Postgres work lately, and the good news is that, unlike when I tried this several years ago, you pretty much can now switch freely between something being a table, or an array of arrays, or a select statement, etc.

The bad news, is every single path way was added ad-hoc, so they all have different names and even different naming schemes ("unnest" to turn arrays into a table, ARRAY to turn a table into an array), aren't coherently linked together, and the documentation is still fuzzy on the types of these things so it takes some experimentation to work out the correct "spelling" of all these conversions. Optimization behavior is even more opaque. I'm sure that will also get better over time. But it's a work in progress, not a completed job.


The result of a query isn't a set but a multiset/bag (sets that allow for multiple instances of the same element). One popular take on this is that this "breaks" the relational algebra which is certainly true if sets are the thing you want to base everything on like a lot of mathematics today. However multisets are in of themself a very interesting structure that has interesting properties which could be helpful in understanding distributed systems and general relativity.

Things like sum types and the option monad would be powerful additions to a RDBMS but I wish that people would not be so quick to dismiss anything because it is not a set. Sets can be extremely difficult to work with, just simple things like adding numbers together as described in peano-arithmetic is bunkers compared with multisets. Everyone knows how the 19 century dream of Hilbert of grounding mathematics in logic/set theory failed but somehow everyone keeps wanting to use them for everything. In my mind this is a shame and I don't think that removing them from the one place where they have a use in society at large is a good idea.


I imagine if we were writing a new RBDMS that wasn't worried about backwards compatibility we could reduce the number of types flying around internally. I think there's a lot of historical accident in which exact variants of set are available at which point. A refinement of relational logic based on a base data structure that is a bit more pragmatic (because Codd's logic is close to pragmatic, but does still have a bit of ivory towerism in it; no criticism inteded, it was a huge advance, but I think we could tweak it a bit in modern times, and in his defense I think there were bits in Codd's work that failed to come out to the pragmatic systems for a long time, to their detriment) would probably also be helpful.

But in the meantime, back here on the ground, it is nice to at least be working in an RDBMS that can convert back and forth between these things, even if it's still klunky. I remember when I just plain couldn't, and the circumlocutions to do what I wanted to do were bigger than the business logic I needed.


The reason sets are important is that they correspond to (Boolean-valued) properties. To each set, there corresponds the property of belonging to that set. To each property, there corresponds the set of all things with that property. I think this is the key reason why the relational algebra is a good foundation for a query language. When I'm writing a query, I'm thinking of some property P, such that I want the results of the query to be all records with property P. By utilizing the correspondence between properties and sets I can translate that property fairly directly into an expression in the relational algebra, and then the magic of a relational query language is that that expression is all I need to write to carry out the query. That's the sense in which relational query languages are declarative. I just write down the property of the results I want, and I get those results automatically without having to specify how to collect those results.

Having queries return multisets rather than sets "breaks" the relational algebra in the sense that it breaks this correspondence. Results of queries no longer correspond one-to-one to properties, since properties have no multiplicity. To be fair, you can identify the property being true or the element belonging to the set with having multiplicity > 0, and the property being false or the element not belonging to the set with having multiplicity 0, and by doing this you can think of SQL queries as corresponding to sets/properties most of the time. But if you're going to think of them that way, you might as well just have them be sets in the first place. The multisets are just a needless complication. Thinking about SQL queries in terms of multisets seems to only be compatible with a more imperative, non-relational approach to the language, where you still have to think algorithmically about how to assemble the collection of results that you want, rather than just directly characterizing the results in terms of a property.


> It was just bad code,

To the point of this thread, it wasn't "bad" then, any more than Newton's ideas about satellites were bad. It was state of the art, and had room for improvement over time.


Fair point; that was some clumsy phrasing on my part. By the same standard even our best code is bad today in some way, I'm sure. (If only for being more verbose and overspecified than "CodeAI, write an optimized routine to losslessly compress video".)


in the relational model the construct that corresponds to optional is a second table that is missing some rows

for example, every foo here has a bar, but not every foo has a baz

    foo  bar
    1    2
    3    4
    5    6

    foo  baz
    1    7
    5    8
you can think of this as baz being an optional attribute of whatever entities the foo values identify. if the second relation is not unique on foo (maybe its primary key is both columns!) you could think of it as being a set-valued relationship, associating any number of bazzes with each foo rather than just 0 or 1

from darwen and date's perspective, having null is an error-prone construct analogous to global variables and goto statements

as a historical note, subroutines predate goto statements by about ten years; subroutines predate not only high-level languages but even symbolic assembler. it is true that many people have written programs with global variables and goto statements and no subroutines, but that wasn't because subroutines hadn't been invented yet, or even publicized; wilkes explained how to use subroutines in his 01951 book "The Preparation of Programs for an Electronic Digital Computer" https://archive.org/details/programsforelect00wilk but they were presumably invented somewhat earlier, possibly by david wheeler

the whole array construct in postgres seems like it was probably a huge mistake


"subroutines predate goto statements by about ten years;"

In academia, perhaps. Subroutines were 'expensive' in the real world for a long time. Many things we take for granted today are too expensive when 16KB is a lot of RAM. I was referring to common practice rather than academia.

"the whole array construct in postgres seems like it was probably a huge mistake"

In a cousin comment where I talk about how we might do things differently if we were starting from scratch, this is one of the biggest things I had in mind. As a column type it's plausible enough, as a generic sequence type you can wrap around any other type. Seems a basic enough primitive that we should have it around. But probably everywhere else I'm using it, I really want a more native "table" type. But I can't really have that. I can kind of see it forming, I think, as the versions march on. But at least as of the version I've been using lately, it's not quite here yet.

In general SQL struggles with type recursion. While it's not hard to see why from a performance perspective you want an engine that doesn't have to deal with arbitrarily-sized values, it gets more and more limiting every year. I've gotten a distressing amount of mileage out of using the JSON types the major engines have been forced to incorporate to stay competitive just as "Look, I just need a little tree here... I can promise it'll stay bounded in size, but I don't want to encode that as a crapton of rows" or "hey, honestly, I've got a modestly complicated structure here that my clients will decode and breaking it out the Official SQL Way is just crazy". This is the thing that really gives me that 1970s Tech feel when I'm working with SQL, and while I appreciate the slow and steady lifting of the restrictions I would definitely like to use a tech that lifted them in a thoughtful, principled manner instead of a multi-decade drunkard's walk across the design landscape.


using subroutines was common practice when 16 kibibytes was a lot of core, it was common practice before fortran existed, it was common practice before people used assemblers, and it wasn't just common practice in academia but also in places like ibm, cdc, hp, burroughs, and librascope general precision

(it's true, though, that they were expensive)

from my point of view the problem with arrays, json, etc., is not the performance but the failures of orthogonality which require you to memorize a zillion special cases, which i think is what you're saying about 'a thoughtful, principled manner'

some of my own thoughts on what that might look like are at http://canonical.org/~kragen/binary-relations


I'm glad to see Chris Date in the paper.

Chris Date has been railing about no RDBMS being an RDBMS for decades.

They aren't "complaining" about Codd, anymore than Einstein "complained" about Newton. They are making valid academic criticisms, discussing solutions, and lamenting that in the past 50 years those solutions have not been implemented in practice.


It does read kind of like a complaint to me.

The question the paper poses is ‘why’ are there no relational DBMS’s. And it places a bunch of the blame, it seems, at the fact that Codd left some things ambiguous and the people behind SQL resolved that ambiguity in a way that weakened the ‘relationality’ of the result.

But you’re right, it’s possible to read this more as ‘blameless post-mortem’ five-whying.

Why aren’t there any RDBMSs?

- Because SQL won, and SQL is not relational

Why isn’t SQL relational?

- because it’s creators misunderstood Codd’s paper in these ways

Why was Codd’s paper open to misinterpretation?

- because Codd didn’t think of these particular complexities…

Which I guess is one way to follow the ‘why’ thread.

But it does slightly ignore the important ‘why did SQL win?’ Branch…


As far as I know, SQL won because it was used by the first commercially successful relational database systems, and then other systems entering the marked needed to support it.

It is a bit like JavaScript. Everyone knows it has its flaws and it is easy to imagine a better alternative, but it is "good enough".


sql won, as i understand the story, because ibm thought cobol programmers were too dumb to handle the relational algebra

i have to admit that i find sql a lot easier to read than the relational algebra, even though the latter is more aesthetically appealing

sql is kind of like golang, it makes your code a bit long-winded and tedious compared to a more crystalline minimal design, but this helps you avoid the temptation of being too clever and thus writing code you will have to debug


Which points to the blame for SQL not implementing Codd’s ideas right not lying at the feet of Codd’s being insufficiently precise, as claimed in the linked paper, but rather as a deliberate choice to simplify the relational model.


SQL tables are multisets (sets allowing duplicates) rather than sets as defined by the relational model. This was likely a deliberate decision in order to make implementations simpler and faster, since it avoids some uniqueness checks. For example it is much simpler to implement UNION if you don't have to check for duplicates. But arguably this push some complexity to the user, since duplicates lead to all kinds of problems.

On the other hand three-valued logic is obviously more complex than regular boolean logic. But I don't agree with the author that three-valued is in conflict with the relational model.


i would rather say to make the relational model more complex


SQL is the perfect example of "worse is better". SQL was an amazing invention for its time, but it has not grown and developed the way other languages have. There is basically no industry velocity in upgrading SQL the way there was with procedural languages.

SQL and C were both invented in the '70s. While C may not have grown as much as SQL has, new major languages that handle similar tasks to C have come into their own, willing to throw out old syntax or ideas where needed.

Now, one could argue that SQLite is the Javascript of SQLs, and that T-Sql is the C# of SQL, and that PL/SQL is the COBOL of Sql, and that Postgres is the Java of SQL... but in all cases those C-derived languages were far more willing to experiment and break out of the narrow bounds set by their antecedents than SQL variants were.

Obviously there are small and interesting experiments in the DB space, but none of those are the kind of industry-standardized improvements to the formula we see in the space of C-derived languages.


> Why did they invent NULL and introduce that third truth value, UNKNOWN?

I fundamentally don't understand how you can have all of the kinds of JOINs SQL has without a NULL.


You can by using “narrow” tables (key value and key key). Download the slide notes from this presentation of how Relational.ai is doing it: https://www.slideshare.net/maxdemarzi/developer-intro-deckpo...


It seems like narrow tables solve having NULLs in the tables you store, but they do nothing about NULLs in the tables you create using, say, a LEFT JOIN. Like, if you create a database with Name, Postnomials, and Prenomials, some people don't have Postnomials or Prenomials, so even if you create three narrow tables, when you JOIN them all to form the full polite addresses, you'll end up with NULLs in the result of that JOIN.


It works a little different in “Rel” (the query language Relational.ai uses). You would create multiple definitions of what a “full polite address” is for each “case” of valid arguments/empty columns and use that going forward. A bit like a UNION without the same column width requirements.


That seems clumsy.

I would think the right approach to "SQL without LEFT JOIN" would be just to focus on making pulling down multiple related tables as distinct resultsets in a single query easier and have the client code work with a graph instead of hammering everything into a single tabular layout. Or leave the concept of "connect these two tables together and make them NULL where not applicable" as an exercise for the client.


Quite the opposite. The idea is to move as much of the business logic into the database. “Rel” definitions are meant to be written once and reused everywhere. Instead of letting the client decide different business logic every time, you capture and control it in one place.


thank you so much. Was starting to read through it and not understanding what the core problem was.


I'm pretty sure you can load a query result into an [1] in memory table and query that including generated indexes.

[1] https://dev.mysql.com/doc/refman/8.0/en/memory-storage-engin...


SQL is a wildly successful invention from nearly 50 years ago that remains extremely relevant today - whatever sort of coder you are, you're probably going to use it at some point. And if you get beyond the surface level SQL stuff, you'll probably still find it pretty impressive in its expressiveness, despite its age. Is there anything else programming-wise from the 70s thats still as relevant today?

The OP paper is interesting but the complaint that SQL is not 'pure' enough reminds me of Project Xanadu's vision of 'pure' hypertext. https://en.wikipedia.org/wiki/Project_Xanadu

HTML won out over something like Xanadu and SQL won out over the 'pure relational algebra' language we never got because making practical compromises is very important.

My take away: Designing something with the right mix of principled design and practical shortcuts around those principles is more important than just rigidly sticking to principles.


I switched to EdgeDB couple of years ago, and after writing EdgeQL I never ever want to go back to writing SQL queries again. SQL now feels very unexpressive and clumsy, but I didn't have that feeling before I had a chance to compare it on practice.

Also, "HTML won" not because it made practical compromises, but because it was literally the only first-class option available in the browser, and as web platform has exploded everyone wanted to move to the web, so they had no other choice but to build their tools and products with what they got.


I really wish people would learn to accept that whatever won wasn't best. It was what won.

Things win for plenty of reasons other than being best. It's so hard for so many engineers to understand. And as a result they do all of these post-hoc justifications of lesser solutions because they've convinced themselves they must be the best "because they won".


True, but SQL has been there for 50 years, that must mean something


s/something/network effects/


"good enough" isn't equal to "best"


What specifically do you like about EdgeQL? From my experience it seems to be only popular with people that don't ever care about the actual query plans and calculus underneath.


The number one thing I like is that the EdgeQL as a language allows to convey clearly what exactly you want from the database. That's kinda a definition of "expressiveness" for me.

> that don't ever care about the actual query plans and calculus underneath

Of course, if you're being paid for this specifically or the nature of your problem domain requires optimization of query plans, you should care. But in vast majority of projects I've participated the database query plans and performance is not a bottleneck at all even remotely. So between "unexpressive and low-level" and "expressive and high-level" it's reasonable to choose tha latter.

Also, I have a belief that EdgeDB engine produces much more efficient SQL code than I would write myself, but I can't confirm nor deny that.


Why on earth should any software developer care about the calculus underneath their database? That is surely only the domain of database developers, researchers, and hobbyists?


Because in the real world abstractions are leaky, and small changes in how the database is used can cause 10× performance changes.


I think it's important for people who take their profession seriously to know a little bit about how things work at different levels than just the one they are used to.


I guess almost every engineer I've known must be a database developer or a researcher or a hobbyist I guess.


EdgeDB seems very promising. What I wonder is does it run on top of Postgres only? Since the main innovation is a better query-language (than SQL) wouldn't it be great if it could be used on top of any database-engine?


Right, my thoughts as well. I'd love to have file-only database like SQLite but with EdgeQL instead. But at the moment EdgeDB is built on top of Postgres. Not sure if there were any discussions about abstracting it away.


> Is there anything else programming-wise from the 70s thats still as relevant today?

Lisp is still relevant (for some definitions of relevant [I work in Clojure so it's very relevant for me]) and it's from 1960.


This is an example of a general (imho) fallacy pattern that also shows up with FP:

There's some branch of mathematics. Some people become familiar with that and like it for psychological reasons. Makes them feel good that they grokked it. Now, some subset of those people end up in positions where they can write software and academic papers. They notice that some problem space relevant to computing is kind of like that branch of mathematics they liked. They embark on a zealous mission to apply that branch of mathematics strictly in the context of the problem space.

Reason I think this is a fallacy is that there's nothing god-given about some branch of mathematics that makes it somehow pristine and golden and "better" than some adhoc solution hacked together by a hacker. Mathematics was also "hacked together", just it was done before computers existed.

So these people are really on a kind of circular journey -- they're taking someone's definition of a VM that is called Mathematics, that typically was derived from ordinary principles present in the natural world, and having the bright idea that they can implement that VM in code and that they will be much more "proper" than if someone else had instead begun with those same ordinary principles and gone straight to software.

Put another way: SQL doesn't need relational algebra. You could just as well derived relational algebra from SQL than the other way around. They both come from the ordinary need to filter and reason about collections of records about things.


The useful thing about mathematics is that it gives you tools to prove that your solution will work for all problems of a certain class. Hacked together code usually doesn't have that property, and if it does then it just is mathematics, whether you want to call it that or not.


The problem with SQL is it is based on relational algebra, just not consistently.

SQL allows you to insert a duplicate row or update a row such that it becomes similar to another row. But then it is not possible to delete or change the row again without changing or deleting all the duplicates, because the query language assumes rows are unique. This is obviously a footgun and not useful behavior.


Yes. After shooting off nine of my toes, I decided to save the last one by always adding a unique something to every table, an auto_increment if nothing else seemed appropriate.


Sure, you don't need a mathematical theory of your software, if you don't care if your results are correct, and only want your software to meet the bar of "all reported bugs have been fixed and tests passed."

Meanwhile, most of the stuff people actually love (and don't want to understand) about their software was invented by people pursuing the mathematical approach.

And those people get away with hacky code because most of the software core is rock-solid and mathematically valid, leaving an "error budget" for sloppy hacks.

But if it were just sloppy hacks all the way down, your gigabytes of source code would not work at all.

You have the luxury of not knowing what I know; that algebraic rigor, while frustrating, probably prevented bugs. And mathematics, while arcane and incomprehensible to you, prevents bugs. You don't want the truth because in places you don't talk about at standup, you want mathematicians writing on that chalkboard. You need them at that chalkboard. We use words like monad, monoid, endofunctor. We use these words as the backbone of a life spent proving something. You use them as a punchline.


"A Few Good Monads" -- coming to theaters soon!

Starring Jack Nicholson as Kernel Jessup


the problem with SQL isn't that it doesn't hew to this or that mathematical framework. the problem with sql is that its irregular, doesn't compose well, and has a very baroque semantics due to the inclusion of NULLs among other things.

using a well-thought out model eases or eliminates some of these very real human problems.

you've got it backwards


we who use these databases for real problems need LEFT OUTER JOIN and to use LEFT OUTER JOIN you need NULL.


There are other ways, like having a null with more regular semantics and better composeability (like Rust's option type).

Or imagine this data:

Students:

Name: Bob

Name: Alice

StudentsClasses

Student: Bob, class: cryptography.

Student: Bob, class: databases

If you can have arrays (or relations) as a field in your result set, then

Students left outer join StudentsClasses

Could return

Student: Bob, classes: [ cryptography, databases]

Student: Alice, classes: []


This violates first normal form. So if you want to query this result further you need a more complex language than relational algebra because you need to be able to query nested structures.


1nf is surprisingly tricky to define.

If you consider it as a single atomic list then it is still in 1nf.

It's only not in 1nf with respect to future queries IF you want to subselect from within the list values.


1NF is defined as eliminating nested relations. I assumed "[ cryptography, databases]" in the example represent a nested relation.

1NF is not defined in terms of what you want to do with the the data, it is defined in terms of domains, i.e data types. If a column allows nested tables, it violates 1NF.

"Atomic" in the context of 1NF means atomic wrt. the relational algebra.

Eg. if we have:

   Student: Bob, classes: [ cryptography, databases]
   Student: Alice, classes: [] 
Is it possible to select from this all students which has cryptography as a class, using the standard relational operators (project, filter, join)?


I think we're mostly in agreement!

You're right, it's not possible to select from this all students which has cryptography as a class. No more than if we did concat('cryptography', ',', 'databases').


The normal forms are about how data is stored, rather than how it looks after a query. After all, most joins will violate second normal form.

But yes, SQL doesn't work well with this solution.


what if I gave you a nice little language where you could easily define transforms on tables and explicitly specify 'a where a.key != b.key'. and a type system that would let you define null and all its various arithmetics.

sure, you'd say, but SQL already does all that

but you've have well thought out type construction, and a standard way of defining abstractions and transformations on data. maybe it would be general enough that you would be tempted to go full-on stored procedure and forget about having to juggle two languages and their interface

still no?


Paraphrasing the complaint doesn't negate its criticisms.


Tradeoffs were made, as always. Although it is fun to remind people that since SQL isn't relational that SQL databases are NoSQL databases.



I don't know specifically how Dated it is, but https://people.csail.mit.edu/jaffer/slib/Relational-Database... might also be of interest.


> how Dated it is

Is this… a Chris Date pun?

You should have said it “Codd also be of interest”.


Finally, Chris Date's name is mentioned in this thread. It was curious that the Darwen's pdf mentions him in asides. Did they have a falling out?

https://cs.stackexchange.com/questions/99350/did-date-and-da...


He is listed in the acknowledgements ...

> ... Chris Date saw early drafts and suggested several improvements.


Ah. Thanks, didn't see that.


Also mentioned in the paper as producing a corresponding paper at the same time.


Funny coincidence. I just picked up Database in Depth from AbeBooks a week ago... ChatGPT brought his book up in a fever dream QA session regarding database normalization and I figured I'd give it a look.


I've come to believe that relational is the wrong choice for most apps.

Most people choose a relational db without even realizing what relational means and what the tradeoffs are.

Almost every app dev would prefer that their app reacts to changes in the entities, and updates it UI accordingly. Incrementally computing new query results. This is non-trivial in the relational model.

The relational model is a bunch of constraints such that relational algebra can be used by the system to re-arrange the query plan and get the same results.

Your queries are optimized, yes, but if you want to stream changes, then you probably want a different query plan altogether. And its not a matter of just making SQL cacheable and streamable like with incremental view maintenance - you probably want different queries to run altogether. Most people aren't building analytical reporting apps with SQL. The database size for a single user or team in most apps is such a small size that aggregations can just be done client-side even.

I think a better way would be to have a database that knows about all queries in your app, and optimizes collectively. And instead of relying on the database to do it, give users the tools to build query plans themselves. And also, throw away the idea of relations and just make all query results reference the underlying entities. Instead of complaining about the "ORM mismatch" with relations, get rid of the relations, and embrace objects.


The relational model (and generally working at the level of sets/collections, instead of the level of individual values/objects) actually makes it easier to have this kind of incremental computation in a consistent way, I think.

There's a bunch of work being done on making relational systems work this way. Some interesting reading:

- https://www.scattered-thoughts.net/writing/an-opinionated-ma...

- https://materialize.com/ which is built on https://timelydataflow.github.io/differential-dataflow/, which has a lot of research behind it

- Which also can be a compilation target for Datalog: https://github.com/vmware/differential-datalog

- Some prototype work on building UI systems in exactly the way you describe using a relational approach: https://riffle.systems/essays/prelude/ (and HN discussion: https://news.ycombinator.com/item?id=30530120)

(There's a lot more too -- I have a hobby interest in this space, so I have a small collection of links)


Man, this is such a great list of links. I have had these ideas floating around my head for nearly a decade now and have somewhat of a radical goal in mind. This has given me some much needed food for thought. Thank you, truly -- and add my name to the list of people who would love to see your other links


Thanks great links. I've been following some of those projects for a while.

I haven't dived into the differential dataflow stuff yet. I think a lot of these projects may be handicapped though by trying to work with existing SQL queries and user SQL knowledge which makes things more complex than they need to be.


> I think a lot of these projects may be handicapped though by trying to work with existing SQL queries and user SQL knowledge which makes things more complex than they need to be.

I could not agree more!

I'm sorry to report I actually peeked through your HN submissions and I think you and I are seeing the exact same problem and trying to come up with a solution... from markup languages to visual programming to database internals to issues with electron, these are all things I've been spending time on! It's uncanny how much our interests overlap

Personally I'm debating dedicating my time to this exclusively as I think it's (a) incredibly exciting and (b) a massive opportunity. It will probably be 6-9 months before I pull the trigger and start a company (depending on how things develop between now and then), but while I have your attention, if you would be interested in reconnecting at that point, I'd love to get your contact info--my HN email alias is on my profile


I'd love to see the rest of your links. I'm interested in this space as well.


> have a database that knows about all queries in your app, and optimizes collectively

I’ve had this idea floating around in my head for a while: if you could take all the queries and the schema, you should be able to compile that to a specialized binary that only does those queries. I have no idea how to accomplish that but I bet you could do a lot of nifty optimization. Migrations get hard but I wonder how good it would be


This is called DBToaster

https://dbtoaster.github.io/

*"DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views"∗

http://vldb.org/pvldb/vol5/p968_yanifahmad_vldb2012.pdf


Thanks for the link! This looks super cool

edit: so this only does the read side right? It doesn't look like this supports mutations


This has been prototyped several times over the decades. Migration is one significant limitation as you note, but the other is that the metaprogramming required to code gen a highly optimized database for a specific data model and workload is insanely complex, and someone has to write and maintain that with the very low defect rate people expect of databases.

The performance is excellent but modern databases tend to be bandwidth-bound for many workloads anyway, so the benefit will be significantly limited.


Another limitation is debugability. I frequently run all kinds of queries against databases to figure out what's going on and compiling in the set of queries in advance could significantly limit your visibility into the data.


its only a performance problem. there wouldn't be any reason to forbid arbitrary queries - they just might not have pre-built indices that make them go as fast as they possibly could


> there wouldn't be any reason to forbid arbitrary queries

There is an argument for this, it's just probably not worth the tradeoff. The security benefit seems somewhat reasonable. Feels like the kind of tradeoff as a unikernel.


Makes me think of the current frontend trend of compiled reactive ui frameworks like Svelte.

Also the SQLite bytecode engine. https://www.sqlite.org/opcode.html


Perhaps a JIT for that bytecode isn't really that different


> Almost every app dev would prefer that their app reacts to changes in the entities, and updates it UI accordingly.

Absolutely. I’ve been going through this tech research process now as I need near-real-time or at least reactivity and I just scratch my head and think “why is this so hard? don’t everyone need this stuff these days?”

> This is non-trivial in the relational model.

Yes, but I don’t believe that it has almost anything to do with the model, and almost everything to do with choices in RDBMSs like Postgres, which are slow moving tankers.

All the abstractions and procedures for materializing views (ie store and run the queries in a push instead of pull fashion) are both theoretically and practically well understood, and half of the support is already there.

WAL + CDC is virtually already the standard, it’s just clunky to use. The missing pieces seem to be more about multiplexing subscriptions and it’s implications on downstream protocols (probably needs wire protocol changes?).

In either case, I very much think the relational model should not be considered mutually exclusive from reactivity, and the implications of assuming that may be throwing out millions of holy babies with a little bit of bath water.


Incremental view maintenance is the general problem and there was recently an extension or feature released to help with this: https://wiki.postgresql.org/wiki/Incremental_View_Maintenanc...

The problem is that I think a lot of people will do these big joins and serve them via graphql or a deep json object...they get all the data they need...but then to update this query may be inefficient depending on the query plan. The query plan won't optimize for fine-grained reactivity of certain parts of the result, nor does it know about other queries that will be run which prevents intelligent caching of sub-queries.


> Almost every app dev would prefer that their app reacts to changes in the entities, and updates it UI accordingly. Incrementally computing new query results. This is non-trivial in the relational model.

I assume you mean detecting data changes and not so much broadcasting (messaging). Views are designed to capture what entity changes you care about. Not all RDBMS' support indexed views and Postgres unfortunately falls into this category, but views are a best practice.

Perhaps the missing piece is that based on the views and frequency of view calls, RDBMs automatically curate and maintain table indexes.


> I've come to believe that relational is the wrong choice for most apps.

The benefit of relational Model is that every operation can return a table. Right? That makes it easy to sequence all operations together, and easy to understand what they are doing, and thus avoid errors.


SQL’s syntax is really ugly, and rather obviously inspired by COBOL-unsurprising for something invented by IBM in the 1970s, and which was developed on IBM mainframes. Yet so many people who think poorly of COBOL’s syntax don’t apply the same judgement to its offspring SQL.


I don't know of anybody that actually _likes_ SQL's syntax, it's just that the value of it being mostly universal exceeds the syntactic downsides.


SQL syntax is awful, but I do think many SQL programmers are aware of that.

One problem is that the SQL standard is very loosely followed by implementors, so the syntax and semantics of each implementation can be wildly different.

For instance, major implementations do not agree on if double quotes denote string literals or identifiers. "test" is either the literal string "test" or it is (e.g.) the value of the column called "test'.

I think this contributes in part to the difficulty of writing down a revised language to replace SQL.

Compare this to e.g. the C language, standardised at roughly the same time as SQL. In C implementations there is much stronger agreement about what the language looks like. I'm not sure why this is the case.


> Compare this to e.g. the C language, standardised at roughly the same time as SQL. In C implementations there is much stronger agreement about what the language looks like. I'm not sure why this is the case.

Some of this is its COBOL heritage - in COBOL, every vendor adds umpteen keywords to the core language for all their vendor-specific extensions - and even the standard itself is full of optional features, all of which add new keywords too. SQL vendors adding lots of new keywords for all their extensions is just being true to the COBOL heritage.

The quotes is a bit different though. I speculate what happened there, is a lot of people who came to SQL from other languages got confused by the whole “double quotes mean identifiers” thing, and so there was pressure on vendors to deviate from the standard and make them mean strings instead (like in most other languages), and invent some other identifier quote syntax instead. Some vendors succumbed to this pressure, others didn’t, yet others did at first but later changed their minds - at which point you end up with some config option (often changeable at runtime on a per session basis) controlling what double quotes mean


Libraries are the main driver for standardization — end-logic doesn’t move between systems very often anyways, and especially with databases where you typically don’t ship SQL to random environments and hope it works (they are largely designed as centralized systems after all).

But the language has no real extension system defined, so everything new gets defined as new language keywords instead of eg a new library/function, which arbitrarily get added as RDBMS’s try to differentiate their feature set. What extension systems do exist typically hook directly into the engine itself, are unique to the DB and have little to no hope of being “shareable”

So the standardization inevitably is less a standard and more of a vague guideline.


We have Stockholm Syndrome because we don't have modern languages to replace it. (except ORMs, which address some issues, but not the big relational ideas) No one thinks SQL is the best possible way to write a query. It takes challenging but elegant ideas about sets, and express them in a language that demos well in the shop window, but obfuscates and interferes in non-toy queries.


To the extent SQL gets a pass, it's because it packs a lot more power in that ugly syntax than cobol does.


tl;dr The chief complaint is that SQL DBMSs support more features than a strict relational model, though you don't have to use those features.

1. Column names. SQL is very flexible so you can do things like SELECT * from joined tables. Perhaps a bit sloppy, but not bizarre IMO.

2. NULL. This is purely a performance optimization. You can have a optional foreign key to another table, without introducing an entirely new table.

3. ISBL and BS12 were obscure and had proprietary IBM baggage.

If you want to use a SQL DBMS will unique column name and no NULL, you can.

If I were going to write an article, I would include a legit deficiency, like no convenient anti-join.


Happy to hear Darwen dislikes 3-value logic. A language where x=x can return something equivalent to false in a common use-case is wrong.


> A language where x=x can return something equivalent to false in a common use-case is wrong.

Then almost all common languages are wrong, if you consider floating point NaNs to be a “common use-case” - and surely they must be, since it is easy to produce them using ordinary arithmetic operations.


Maybe you do different programming from me but I find NULLs far more common than NaNs.


At my day-job, I almost never use floating point, so NaNs don’t come up. The one exception to that is JavaScript, where NaNs pop up significantly more than in most other languages - parseInt() will return a NaN if you give it an invalid number, most other languages a function called “parseInt” wouldn’t return floating point.

I work on a lot of code that uses SQL NULLs, but most of the time I don’t encounter them directly, because I’m using some ORM which translates them to Java nulls (or Go nil or whatever), and many query generation frameworks handle the weirdness around querying for nulls (IS NULL vs = NULL) automatically. Most programming languages don’t give their nulls the weird equality semantics that SQL nulls have


> The one exception to that is JavaScript, where NaNs pop up significantly more than in most other languages - parseInt() will return a NaN if you give it an invalid number, most other languages a function called “parseInt” wouldn’t return floating point.

Ahh, every time this conversation about how SQL Nulls violate reflexive property of equality, people are like "so do NaNs" and I'm wondering "who are these people who are dealing with NaNs all the time?".

I'm mostly a back-end and DevOps guy, so I don't write that much JS. I didn't know that JS used NaN the way other languages would use Null or exceptions for numerical operations.

...

Also that's terrible.


I know pretty much nothing abut realtional algebra but I have written and read reasonable amount of SQL. And I find 3-value logic very handy, even if I occasionally see these arguments that it is somehow wrong? Can you ELI5 why I should dislike NULL and what should I do with my database when I do not know the value of an attribute or the attribute is not applicable for the row? (The answer to the latter might be to design a "better" schema? To that my response is that there are other measures for "goodness" that may be more important than ideological purity. Say, you may want to have all your data in one table instead of creating a new table for a thousand different subcases.)


The complaints about NULL are because its semantics are inconsistent. It is used for a number of different purposes:

– a missing value (existing but unknown), for example a person's birth date that happens to be unknown

– the non-applicability of a value, for example the spouse of a person who never married

– a combination of the above, for example the death date of a person where it is unknown whether they have died yet (but if it was known that they are alive, the death date would be NULL all the same)

– an empty set (see for example https://dbfiddle.uk/UJJgVCZ_)

Depending on the context in an SQL expression, NULL sometimes behaves like an unknown value (propagates upwards in expressions), sometimes like a non-value (doesn't match any conditions), sometimes like an empty set. This makes it unintuitive and difficult to reason about.

You are asking about what else to use. In principle you could use placeholder values (like using the date value 9999-12-31 to mean "hasn't died yet"), but that has its issues as well. The bottom line is, there is really no good alternative in SQL. And that's what people dislike about SQL. Because you could imagine a database language with a standard NULL placeholder value with consistent and straightforward semantics.


Right. If they needed NULLs, then imho full algebraic datatype tagged unions would be the natural extension of these. Now, obviously you can implement those in SQL using one column per type and one column as your discriminator tag, but that would be excruciating, especially considering how weak SQL generally is about standard libraries and reusability outside of the very fixed concepts like Views and Functions. Then you could properly define the semantics of your "missing value" as appropriate to the data.

Various SQL servers have accepted the need to support stuffing multiple data-types into a single column - MSSQL has SQL_VARIANT, Orcale has ANYDATA, SQLLITE has... well, everything in SQLLITE. So between that and the existence of NULL, the ship has already sailed. May as well make it formal and schema-defined behavior using tagged unions.

As the rest of the software industry has figured out, `Maybe<T>`s are better than NULLs.


Isn't a placeholder value clearly worse? If I want to find a list of people that died the same day, I don't want to include people whose death dates are unknown in one big group. And using a a placeholder value for people who haven't died yet is almost asking for the world to end on that date.


The placeholder value would not be a valid value for those operations, and you would get a parse-time error when attempting to group or otherwise match on values where you didn't exclude the possibility of the placeholder value. Like how in some programming languages you get a compile-time error on `foo.bar()` when `foo` may be null, but for example `if (foo != null) { foo.bar(); }` compiles, because within the `if` body `foo` is now non-optional.

Similarly, in a database language the column/field references would be typed as optional or non-optional, and you are only allowed to perform value-related operations on non-optional references. And an optional reference can be made non-optional by first passing the data to a condition equivalent to SQL's `IS NOT NULL`, for example. Or for operations that can support NULL (like maybe sorting), those would have a different form for supporting NULLs. For example, plain `ORDER BY` would cause a parse-time error when applied to a nullable column where NULL has not been filtered away, but `ORDER BY ... NULLS FIRST` (or `LAST`) would be applicable to nullable columns. Similarly for grouping, there would be options if you do want to include a NULL group, but by default it wouldn't type-check.

But you wouldn't get the nonintuitive behavior of SQL where NULLs sometimes act like NaNs, for example `NULL not in ('foo', 'bar')` evaluating to false, and `NULL in (NULL, 'foo', 'bar')` also evaluating to false.


Yous have another column which is a flag giving the stae of the attribute. Note that not known and not valid for the row are two different states. If you set the attribute to NULL - which of the two do you mean?

Actually you give the answer for attribute not applicable - and you don't get a thousand different subcases. The measure of goodness is does your db model fit the real world data.


So I end up having values in the actual column that may be complete junk and I need to check another column to know that? I would expect that to cause trouble sooner or later.


Right. Attempting that workflow would be crazy without language support.

But at the same time, proper languages that have operator overriding and custom types make that kind of thing trivial - look at how many languages have a Maybe<T> these days, which is basically automating that process.

The fact that doing it in SQL would be suicidal shows that, while SQL was a brilliant language for the '70s... it has not evolved nearly enough.


Actually, systems that don't crutch on the "Law of Excluded Middle" are more robust and realistic than ones that do.


May interest you https://www.dbdebunk.com/


I poked around this site and it's mostly Fabian Pascal ranting about how everybody else is doing things wrong and are also idiots.


Pascal is Codd&Date's unwanted overeager hype man. I appreciate that he cares, and he is smart, but he needs someone to make him presentable.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: