Hacker News new | past | comments | ask | show | jobs | submit login
Unreasonable Effectiveness of SQL (couchbase.com)
166 points by mariuz on April 4, 2019 | hide | past | favorite | 72 comments



The relational model is awesome. SQL is one of the worst languages I have ever encountered. The syntax is awkward and unwieldy. The only unit of abstraction in SQL the language is the VIEW. I end up with so many repeated patterns in SQL it's mind numbing.

I would love to see lisp-style (AST) macros applied to the relational model. Here's an example that might have an obvious solution (please share), but not off the top of my head. Find the id of every row having the minimum value per group by aggregation. I have to write a subquery to do this, but the transformation is quite regular and predictable. Where's my macro?


I believe some of the window functions[0] (specifically, row_number) will do what you want. Honestly though, these might not be a readability improvement over subqueries.

[0] https://www.postgresql.org/docs/11/functions-window.html

Edit: Just tested it, it works but I wouldn't call it pretty.

    create table testing(id int, grp int, other varchar2(255));
    
    insert into testing values (1, 1, '');
    insert into testing values (2, 1, '');
    insert into testing values (3, 1, '');
    insert into testing values (9, 2, '');
    insert into testing values (10, 2, '');
    
    
    with cte as (
      select grp, id, row_number() over (partition by grp order by id) as rn from testing
    )
    select grp, id from cte where rn = 1;


Another solution for PG which i've used many times:

    select distinct on (grp) grp, id
    from testing
    order by grp, id;


Neat! It doesn't work on Oracle, but you implied that it's PostgreSQL-specific so I somewhat expected that. It's a lot cleaner than mine and, even though it's not something I've seen before, it's fairly obvious what it's going to do.

When I did my testing I did it on Oracle since I have it available at work. For my non-work projects I use PostgreSQL and, for the most part, find that I can flip between the two database without too much trouble. Of course, this also means that I don't know about some PostgreSQL niceties like what you showed.


Very true, it's a nice PG specific extension of the SQL standard. Would be nice if some of these types of extensions made it back into the standard to increase the likelyhood other DB's you use would also support it.

It's syntactic sugar, but it is a whole lot more readable in some cases than a window function. It's nice to not need a sub query or cte to express it.


Holy shit. Thank you!


I think what paddy_m complains about is SQL’s tendency to require copy-paste programming.

For example, a phrase like row-number() over (partition by grp order by id) might return in many, many of your stored procedures, but there’s no way to only write it once in SQL.

The same thing effectively applies to that common table expression. There, you can create a view, and just write

  select grp, id from grouped_tests where rn = 1;
, but many SQL query planners won’t optimize queries that well across views, so you end up copy-pasting the CTE, or, worse, not even doing that because you have to write an even uglier query to get decent performance.


You can also express this example using the first_value window function:

   select distinct grp, first_value(id) over (partition by grp order by id) as rn from testing;


The only challenge here is that you can't bring out other fields from the table.

Building on my previous example:

    insert into testing values (8, 2, 'stuff');   
    select distinct grp, first_value(id) over (partition by grp order by id) as rn, other from testing;
This returns two rows for grp = 2, whereas adding the 'other' column to my first attempt (both in the CTE and the main query) works as expected and only returns one row for each grp.

This would work properly within the CTE, but then you're joining to the CTE instead of just querying it. There are many ways to solve this problem.

My ideal solution would be something like this that gets rid of the CTE:

    select grp, id, row_number() over (partition by grp order by id) as rn, other from testing where rn = 1
But that doesn't work. Oracle only allows window functions in the SELECT clause, not in the WHERE clause, plus you can't reference the column alias anyway.


Bear in mind that CTEs act as an optimization fence, so the filter (rn=1) wont be pushed to the CTE. The CTE will be materialized and only then the filter will be applied. I believe pg12 will change this behaviour.


That is only in postgres. The optimization fence is not part of the standard, was only created in postgres for simplification of implementation, and is likely to be removed soon (version 12).

https://www.depesz.com/2019/02/19/waiting-for-postgresql-12-...


I don't have time today to try to build it, but that challenge seems like a textbook usage of window partition.

https://www.postgresql.org/docs/9.1/tutorial-window.html


Hey, saurik, do you know there are plenty of people waiting for you to upgrade Substrate for A12? We know you are maybe too busy to upgrade substrate but can you tell people when that thing will come? We all are looking forward to seeing this happens. So can you reply me?


Thanks for the responses, looks like I have some digging to do. My point remains though that I don't know of a way to write user extensible syntax in SQL. I can get by fine in most languages without user extensible syntax (especially python), but the lacking feature really hurts in SQL.


This isn't ideal but SQL got a lot less mind-numbing for me when I got decently good at vim. Maybe SQL doesn't have macros but vim macros, dot, and column mode reduced my repetitive typing by a lot. Starting with a tool-generated list of column names often helped too.


The downside of that is that you end up with heaps and heaps of buggy queries and SQL code resulting in a maintenance nightmare.


Are you saying that about SQL in general? I end up with the same queries as if I'd typed out the SQL any other way.

Or are you saying that efficient editing makes people forget about writing clean, concise code? I don't find that to be the case.


There's other abstraction ideas like synonyms and functions in most engines.

select top(1) with ties

id

from

table

order by

row_number() over (order by value asc)

In a meeting but is that the gist?


Depending on the version, you might be served by using MIN_BY(x,y) or MAX_BY(x,y). Returns the min(max) value of x for group y.


Not sure I understand properly, but maybe this:

> select groupbycol, array_agg(id) from table group by groupbycol having avg(minavgcol) > T


SQL has a lot to recommend it; we've basically built a company around the effectiveness of SQL as a way to interact with data. I will say that it's pretty bad at code re-use, and `with` clauses are pretty awkward for breaking down a complicated analysis into small stages.

Kusto/Azure Data Explorer [1] is a really interesting new query language, focused on analytics. It fixes a lot of the things that are awkward about SQL. A simple example:

  StormEvents 
  | where StartTime >= datetime(2007-11-01) and StartTime < datetime(2007-12-01)
  | where State == "FLORIDA"  
  | count 
[1] https://docs.microsoft.com/en-us/azure/kusto/query/


It's the unreasonable effectiveness of a language and an approach that has some real mathematical backing behind it (relational algebra).

Lisps are unreasonably effective because they have lambda calculus behind them. The Hindley-Milner type system and linear types, both mathematically sound, are a large part of Rust's effectiveness and success, in my eyes.

I still would like that SQL read a bit more set-theoretic, if less English-like. It would make it less quirky and more composable.


> "Lisps are unreasonably effective because they have lambda calculus behind them."

I don't really buy that. CL remains unreasonably effective even when used by somebody using it in an imperative manner. And I think the majority of CL code out there is truly multi-paradigm, not functional. I think most of the unreasonable effectiveness comes from the relationship between s-expressions and cons.


I could see the unreasonable effectiveness of a Lisp (CL or Scheme) beginning with the elegance and programmability that comes from being based on the lambda calculus. But to me it feels like the power of CL comes from having acquired a variety of features that can exactly solve day-to-day problems and edge cases.

Take the condition system: it first struck me as weird and unnecessarily complicated until I encountered the use-case that conditions exactly solve.

I think that comes from engineers using the language being the same engineers who can add features to the language. It's an extreme form of dog-fooding.

It's like a beautiful mathematical core that grew a lot of scar-tissue to handle wordly edge cases.

EDIT: because I used Slack formatting, not Hackernews formatting.


Being imperative / functional is sort of orthogonal. The s-expressions that represent everything give you an immense power of composability, both in functions and in macros. You can define your own language in terms of Lisp in a way you can't in a Algol derivative.

I suppose that s-expressions and the ways to transform them when they represent a function application (that is, basically always) critically depend on lambda calculus. I can be wrong, though.


As I understand it, lambda calculus is strictly functional. I also think the relationship between s-expressions and lambda calculus is tenuous at best. If anything, it's s-expressions and lambda calculus that are orthogonal. You could use s-expressions as your notation for lambda calculus, but you could just as well not.


You are correct. S-expressions are a serialization format for trees and lambda calculus is a computational formalism. There is no inherent connection between them. You can have either one without the other.

However, the two together make a really magical combination because each one is a sort of local maximum in its respective design space.


Of course. Haskell is even more faithful to lambda calculus than Lisp (because it has automatic currying and no side effects) but it doesn't use s-expressions at all.


So they have inherited their unreasonable effectiveness from the unreasonable effectiveness of Mathematics? Sounds about right.


> I still would like that SQL read a bit more set-theoretic, if less English-like. It would make it less quirky and more composable.

I wish that it didn't introduce an incomprehensible ternary logic for null handling. SQL with algebraic data types would be amazing.


Nulls are facts of life. I wonder how they could be succinctly represented in a query language. Something like the `?` operator in C# / Kotlin?


That would be an improvement. I'd rather have proper algebraic sum types, though (Option or Optional in languages like Haskell, OCaml, et al). They're really useful for all kinds of stuff. Perfectly representing missing values is almost a fringe benefit.


Not only would algebraic sum types be awesome, but if user-defineable they could eliminate so much of the ambiguity in how nulls are practically used in SQL. People often say that null just means unknown but that is a bug-causing simplification. In reality, null could mean unknown, but it could also mean not yet known, uninitialized, missing, impossible, uncomputable, join conditions not met, or hundreds of other things. And sum types would be extremely useful even outside of null handling...there are thousands of reasons why more and more modern languages are designed with sum types.

Not only would sum types be extremely helpful but it would also be awesome to have a standard for product types. As it stands, some implementations have them but they are clunky, and others you have to resort to just using multiple columns, which is pretty bad for composability, bug-ridden, and leads to behavior that is almost like Hungarian notation in order to maintain unique column naming. For example:

    create table foo(
      start zoneddatetime,
      end zoneddatetime
      ...
    );
versus

    create table foo(
      start_time timestamp,
      start_tz timezone,
      end_time timestamp,
      end_tz timezone
      ...
    );
(timestamptz is not an acceptable substitute...it doesn't store timezone)


I assume you mean how null has to be compared with the IS operator instead of equality? I’ve wondered the same thing, but would making Null==Null have some strange consequences for left join operations? You could pretty easily and unintentionally return the Cartesian product of the two tables. Requiring a different operator for comparing to bill prevents that.


Having optional types (or algebraic sum types) would eliminate this as well has hundreds of other sources of bugs, and would eliminate the 3 valued logic that few users truly understand.


SQL's null really isn't incomprehensible.

It just means "unknown". It's broadly like "undefined" in js.


3VL stuffed into a boolean logic system is what’s incomprehensible.


It seems to me SQL is, paradoxically, too low level, much like assembly language, in the sense that it makes you specify the joins. Over and over and over again. Why doesn't it infer them from the structure of the database (the relations between tables)? If the answer is performance, well, that means we are in agreement, and SQL is not high-level..

In summary I think a QL should produce a table from a database when provided with 1) the set of field names desired, and 2) the set of constraints on the desired rows. Anything that goes beyond this is in need of justification.

It is for this impedance mismatch (of understanding) that I have an aversion against SQL (therefore I am not well-versed). I basically see SQL as an exercise in complexity, resulting from the constraints prevalent in the seventies. Please prove me wrong!


"It seems to me SQL is, paradoxically, too low level, much like assembly language, in the sense that it makes you specify the joins. Over and over and over again. Why doesn't it infer them from the structure of the database (the relations between tables)? If the answer is performance, well, that means we are in agreement, and SQL is not high-level."

First of all, it absolutely can and does infer joins if you use a natural join.[1]

However, a lot of the time you don't want the database to infer the joins because how you join two tables depends a lot on the specifics of your query. Not every join is along a clean primary key/foreign key axis. Sometimes you have datasets from different sources. Othertimes, you have more novel relationships between your tables. The whole point of the relational model is to give the end user a lot of flexibility in querying the database which means you need to specify the joins a lot of the time.

1. https://docs.oracle.com/javadb/10.8.3.0/ref/rrefsqljnaturalj...


> However, a lot of the time you don't want the database to infer the joins because how you join two tables depends a lot on the specifics of your query.

Some 90%+ of the times, you want exactly the join the database would infer, but natural join is basically useless, so you must repeat the basic join all the way every time.

SQL did it backwards. The simple join keyword should mean natural join over a foreign key. A cross product should be created with the `product` keyword or something similar.


> Why doesn't it infer them from the structure of the database (the relations between tables)?

That's because the structure doesn't always has the feature which you need this specific time to rely upon. And that's a feature, not a bug, which allows having flexibility between data structure and data queries.

If you have a relation, DB engine maintains that, ensuring integrity defined as such. However, you are free to query non-integral (in this sense) references, which you're able to describe in queries. If you won't be able to specify joins, you'd have to always express relations in the structure. Since queries in SQL can be arbitrary, to reach parity with flexibility you'd have to have all possible relations. Then you have the problem of choosing what you need for this moment. In other words, some relations are not known before query time.


> in the sense that it makes you specify the joins. Over and over and over again. Why doesn't it infer them from the structure of the database (the relations between tables)?

The answer is that there are different kind of joins (you may object that this is part of "set of constraints on the desired rows", however, then you are just specifying joins explicitly), and that generally speaking this is a non-unique property, i.e. you could usually infer several joins that represent dramatically different result sets that still have the same tuple archetype.

I should point out that these issues aren't brought on by the added expressiveness of SQL, but rather, these issues are already present in basic relational algebra; I'd argue they're fairly fundamental.


I take the opposite view, SQL is very high level.

It's a declarative language that in theory abstracts you away from the underlying execution environment, it's up to that engine to work out how to optimally execute the query, you as a programmer should focus on the data you require.


This is all layers of abstraction, I think the GP is saying it's low level in the frame of reference of the data and relationships, which is true because you have to explicitly join everything together 'by hand' even though you could imagine abstractions that could be built on top, making this unnecessary.

In the frame of reference of how the queries get executed, yes it's high level. But this is the same thing that could be said for, say, C - it's high level in the frame of reference of CPU instructions, low level in the frame of reference of memory management.


> I think the GP is saying it's low level in the frame of reference of the data and relationships, which is true because you have to explicitly join everything together 'by hand' even though you could imagine abstractions that could be built on top, making this unnecessary.

Data abstractions in SQL are accomplished with views:

https://www.postgresql.org/docs/current/sql-createview.html

Model your most common joins once into a view, then use the view instead of repeating the joins everywhere.


You're missing the forest for the trees. SQL & RDBMS' support use cases that might as well be cheats with how easy it makes it to do non-trivial queries.

> In summary I think a QL should produce a table from a database

In SQL we call these projections, which is just whatever columns are being returned from the amalgamation of tables you used in your query. This could include aggregations or window functions.

Things are possible in SQL that are simply more difficult with anything else, and my time and my companies are more important than figuring out the perfect approach to every problem. SQL makes 80% of problems trivial, and the remaining 20% I can easily solve somewhere else (like Elasticsearch or sqlite or even custom serdes to disk).


1. Postgres does provide natural joins:

http://www.postgresqltutorial.com/postgresql-natural-join/

2. Not all joins can be inferred like this:

` FROM People AS p JOIN Age_Groups AS ag ON ag.min_age <= p.age AND p.age < ag.max_age`


You might be interested in this article:

https://blog.jooq.org/2018/02/20/type-safe-implicit-join-thr...

With some more UX work around foreign key naming/declaration and careful attention from the developer (maybe helped by the IDE) to the caveats mentioned, this impedance mismatched could be reduced.


You can just create a view for each table with all other useful tables joined and query that.

It would indeed be nice to do that automatically based on foreign keys; I suppose one could write a PostgreSQL extension for that (or maybe one exists?).

Of course you still need the ability to perform joins that are not implied by the database structure.


ORMs like Hibernate / jOOQ work around this by providing a convenient "dot notation" to navigate a foreign key path. This is particularly useful when keys are composite. See for example: https://blog.jooq.org/2018/02/20/type-safe-implicit-join-thr...

Indeed, this should be available in the SQL language as well. I can't think of a performance problem, having implemented this in jOOQ. There's always a very obvious mapping between the notation and inner/left joins, which are reusable for common paths and subpaths.

At least for navigating to-one relationships. To-many relationships raise more questions.

Some RDBMS support a T1 JOIN T2 ON KEY syntax, which is nice as well, but the path notation is much more powerful:

    SELECT 
      c.first_name,
      c.last_name
      c.address.street, 
      c.address.zip, 
      c.address.city
    FROM customer c
Instead of

    SELECT 
      c.first_name,
      c.last_name
      a.street, 
      a.zip, 
      a.city
    FROM customer c
    JOIN address a USING (address_id)


It's not at all clear to me how implicit joins could have the same expressive power as explicit joins. How could the system infer if you wanted a LEFT JOIN, INNER JOIN, etc?


Given a schema, you can generate an infinite number of queries. The starting from the business question is a better choice. There are some interesting work done in this area, but still a long way to go. Here's an example: https://einstein.ai/static/images/pages/research/seq2sql/seq...


There might be some value in an "INFER JOIN" keyword that tried to guess what relationship you wanted to use for the join, either based on defined relationships between the tables or based on statistics about the most common join conditions actually used.



> makes you specify the joins

It not.

    SELECT * FROM "document", "line";
Is valid. "But what it do?"

A CROSS JOIN. ie: a Cartesian product. That is not what most of the time you want.

Why cartesian? Because this is the most fundamental join of all. Is like "fold" in functional languages (ie: You implement map, filter, etc on top of fold).

Consider the JOIN clause a extra help. Not was always present:

https://stackoverflow.com/questions/25161747/how-to-implemen...

When using old fox pro, JOIN was not available.

----

Ok, but why not infer that if I join FROM "document", "line", it use the PKs??? That could be nice, but SQL is kind of EXPLICIT with not many implicit exceptions. And you don't wanna that behavior all the time.

This is in contrast to OO. Where you can only use a kind of "join", but lack the capabilities to do much more without extra work.

In fact, put JOINs not bother much. It fit well. Other aspect of the syntax of SQL are more annoying.

----

> If the answer is performance... SQL is not high-level

So, if you are on python, and use a generator to pipe several inputs instead of materialize each step in a big list and pass to the other, because I'm aware that is not memory efficient, suddenly is not high-level?

> I basically see SQL as an exercise in complexity, resulting from the constraints prevalent in the seventies. Please prove me wrong!

I say is the TOTAL OPPOSITE. Sql is TOO SIMPLE!

You are correct about it to be a product of the era. More exactly, the original plan was use SQL for the END USER to interface with the database. To write "off-shots" queries, not to be a proper programming interface.

That is why, SELECT come first, instead of Last. Also, why SQL can't be composed, you don't have variables, can't express loops or recursive functions, and many many other really annoying stuff. SQL is alike CSS. Too simple. Not even lisp-like capable.

P.D: Of course, exist a lot of extensions to cover for it (including WITH clauses), but the fact is, sql was not made for be a proper database language and it show.

Why is this relevant? Because SQL with a bit more power could be abstract and allowed to fix any small shortcoming more naturally.

----

I'm in the hunt to build a relational language, not tied to a particular RDBMS, with full programming power. Not follow the SQL syntax, but more the LINQ style:

http://tablam.org

Still in research/alpha, and looking for help!


Prolog has always made more sense to me as a query language than SQL, but SQL is more popular because it's "English-like." Which is exactly the reason SQL bugs me.


I've never managed to wrap my head around prolog but I'd say SQL bugs me for the same reason. Care to throw out some examples?


I need to create some. In the meantime this is not a bad writeup:

https://stackoverflow.com/questions/2117651/comparing-sql-an...

To clarify, I think Prolog is a lousy data storage language but a fantastic query language. What I've done in the past is used Prolog to query a Common Lisp object-oriented database (after modifying the Prolog to produce composable lazy streams, which is not usually included in the Prolog box but is very handy for database querying.) What I want to try next is the above with standard PostgresQL as the backend repository. That will make possible an apples-to-apples SQL comparison possible.


> ... used Prolog to query a Common Lisp object-oriented database ...

Could you share more details about this? I'd be very interested in exploring something like this.



Interesting the article mentions spatial data and Oracle Spatial, but no PostGIS. The latter, afaik, is state of the art for spatial data sets.


Datalog is much more reasonable than SQL. Every time I have to write SQL more than 100 lines I go crazy.


Do you use Datalog with Datomic and Clojure or another database?


I don't use Datomic since it's closed source. But the design is really, really cool.

Actually, Datalog is a subset of Prolog, and could be used on any kind of data, not necessarily database. Nowadays, it's implemented in different languages, just like the Clojure version.

So like SQL, Datalog it's a declarative language and implemented on different kinds of platforms but IMO it's much more reasonable than SQL since SQL is much worse on composition.


Great article and ref links.

Most people ignore the fact that Recursive Common Table Expression (CTE) makes SQL Turing complete


I have always used RDBMS through an ORM, usually Django's. I understand basic SQL syntax, but nothing beyond that, and I cannot think in SQL if that makes sense. Is there a good book or course for someone like me to dive into SQL, preferably Postgres?



Phillip Greenspun occasionally (every couple of years) teaches a 3 day deep dive on SQL at MIT. It's free and open to everyone. I have gone to TA the class twice and I learned a bunch walking students through the exercises. Highly recommended.


This course is a wonderful start

https://lagunita.stanford.edu/courses/DB/2014/SelfPaced/abou...

After this I would have a look at the postgres documentation. There's so many great examples between there and stack overflow.


This brings back memories. I remember going through the course many years ago, but I got busy with other projects and quit about 25% through. I'm glad it is still out there and still considered a good source.


Yeah if you've gone through a good chunk maybe complete it.

I really hate knowing how easy it is to do some queries and struggling to use the Django ORM to achieve what I could write in minutes - but for maintainability I hate dropping down to SQL in that context.

For a fun advanced tutorial, have a go at pivot tables with crosstab using the docs. I've used it for analytics queries to chart categories as headings, with complex aggregations etc.

https://www.postgresql.org/docs/9.1/tablefunc.html

Window functions are another classic

https://www.postgresql.org/docs/9.1/tutorial-window.html

In general working with SQL and gradually solving the problems you have learning as you go is excellent. If you want faster than that there probably are courses but I really do rate their docs. Postgres is great.


I had the same issue and can highly recommend this course: https://classroom.udacity.com/courses/ud198

It includes short videos, quizzes, articles, and exercises.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: