Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

SQL has many problems: "from" should come first to help auto-complete; null handling is complicated; the syntax in general is weird. Compatibility is a problem (I have implemented some relational databases in Java: Hypersonic SQL, H2; compatibility with other databases is hard).

There is little innovation in the database space: there are hundreds of general-purpose programming languages, but very few "database" languages. I also tried to specify a new language, https://newsql.sourceforge.net/ many years ago. There's GraphQL, but more innovation in this area would be great.

Many developers end up writing huge SQL statements (one statement that is hundreds of lines). You wouldn't do that in a "proper" programming language; but in languages like SQL (or maybe Excel) this will happen.

Another problem is proper encapsulation. Views can be used, but often developers have access to all tables, and if you have many components this becomes a mess. Regular programming languages have module system and good encapsulation. SQL statements often don't use indexes, for one reason or the other. With regular programming languages, it is harder to end up in this position (as you have to define the API properly: encapsulation). Regular programming languages don't require "query optimizers".

SQL is used to reduce network roundtrips. Clients send SQL (or GraphQL) statements to the server, which are actually small programs. Those programs are executed on the server, and the response is sent to the client. (Stored procedures and for GraphQL persisted queries can be used - but that's often happening afterwards.) Possibly it would be better to send small programs written in a "better" language (something like Lua) to the server?



> Regular programming languages don't require "query optimizers".

Isn't that what llvm does? Or GCC? Or a JIT?

> Possibly it would be better to send small programs written in a "better" language (something like Lua) to the server?

I'd certainly like something better than SQL, but going for an imperative, Turing complete language seems like the opposite direction of where things should be going in this scenario.


Most tools allow you to jump down to low level stuff. Isn’t it weird that I can’t just write the query plan for Postgres?

My theory is people would understand indexes and the like way more if they were writing up the lookups


> Most tools allow you to jump down to low level stuff

Eeehhh, kinda. Sure, a compiler will let you just write some assembly by hand, but it won't let you write a optimization step, or even directly configure it. You have to tweak your code or the compiler flag to give the right hints so the compiler generates the binary code you want. Doesn't look much different than a query planner.


That's what I want too.

To make a loose metaphor, if SQL is like high level opengl, then I want the low level Vulkan and shader APIs for hand optimization.

I would like this especially for Apache Spark sql, since a bad plan that isn't aware of the data shape can blow gigabytes over the network unnecessarily.


Spark has datasets, which I think lets you do this, although it’s only supported in Java and Scala, not Python.


Some databases like ArangoDB (https://www.arangodb.com/) allow you to use Javascript instead of SQL.

However, using a type-unsafe (read weak typing), turing-complete language introduces the usual problems we know and love, such as infinite loops, runtime type errors, exceptions, and the like.

Personally, I'm looking forward to a WASM runtime for databases -- so we can run webassembly on the database. This COULD be carefully designed to be statically checked and, possibly, make it really hard to write runaway loops.


Many databases support other languages as well (eg. PostgreSQL supports many including Python, by default). One challenge is lack of standardization. (SQL is a weak standard, but at least a standard).

Weak typing: what about TypeScript?

Slow loops: yes, this is a problem. However, SQL (and even more so, GraphQL) also has a problem of large results / operations spanning too many entries. During development, the number of entries is fine, but not in production. Specially if indexes are missing, this is also a problem in SQL. (Endless loops are actually less of a problem than slow loops: it's easier to detect endless loops during development).

To process large results in a good way, often, pagination is needed; best would be keyset pagination. What if a statement returns a "continuation" statement in addition to the result set? If the client wants to get more results (or process more entries), then he would need to run the "continuation" statement.

Say a database doesn't provide SQL, but instead a set of low-level APIs (plus a ton of tools). Developers would then (be forced to) write properly modularized, versioned code on top of those APIs.


These are all really great complaints; and yet I still prefer writing plain SQL to using ORMs and their nifty tricks. I think the main reason is that I don't ever want to have a lot of SQL logic in my code. Short queries (say, 10 lines or less) are okay, if they're one-off and pertain to the function's logic and are never repeated by other pieces of code. But writing complex queries through an ORM layer means they're going to be in code -- and harder-to-maintain -- than if they were where they should be: In views, functions and procedures. ORMs also obscure the execution plan and make optimization more difficult, which in production isn't something I really want to worry about.

Writing SQL is like uploading a program; just like a shader to a GPU. And if the network stopped being the bottleneck, the DB server's load would become the problem. Just like with a GPU, you want to upload your programs and data as infrequently as possible. But unlike a GPU, you have full flexibility to permanently store your logic right there in your SQL DB. If it's well-documented, it's much more streamlined to write views and views-on-views, than to keep all that DB logic in your code. I don't find that it becomes a mess at all, because properly crafted views and procedures strongly encourage forced indexing and checking execution plans when you refactor them; as opposed to in-code, ad-hoc queries where it's hard to know if someone ever ran tests to optimize them.

Some statements do need to be very large, for efficiency. Others are better being broken up into sub-procedures, temp tables or even roundtrips. A great query is a work of art where each part of it is optimized not just for execution plan but also for read/write, network state and expected load. What I like about SQL is the flexibility to decide which way to go, and prioritize based on how often something will be used and what kind of strain it'll place on each of the layers in the stack. As with anything powerful, there are a million ways it can be abused. But what I've found in the ORM space and also with noSQL solutions is that simplicity of use comes at the price of flexibility and control.

[edit] >> null handling is complicated

This would probably also be my biggest complaint about the language itself; the existence of IS NULL / IS NOT NULL and both null-safe and -unsafe comparisons gets confusing (and especially painful if you're trying to write a parser). I think it's one of those things like `undefined` vs `null` vs `false` in Javascript where, on second thought, it's just too late to fix it without breaking half a billion websites.


> I don't find that it becomes a mess at all, because properly crafted views and procedures strongly encourage forced indexing and checking execution plans when you refactor them; as opposed to in-code, ad-hoc queries where it's hard to know if someone ever ran tests to optimize them.

I have to provide caution here, when you create views and stored procedures, I've always found it a mess to maintain:

1) VCS for stored procedures and views is usually non existent. Good luck understanding how these change over time and who changed them.

2) Deploying application logic alongside these systems is very painful, because unless you've versioned your views/stored procedures, when you make a breaking change you need to stop the entire application, make the change in the DB, and restart the entire application again on the new version. Zero-downtime deploys would require versioned views/stored procedures.

3) It quickly becomes a challenge to answer the question "what happens when I do X?" reliably, where X is insert, delete, or even select a row. Once you have complex enough stored procedures, modifying the data changes things in unexpected or unintended ways. A dev unaware of how the system works might assume that inserting a row simply inserts a row, but instead due to stored procedures hidden from application logic, it might cascade into inserting rows elsewhere, deleting rows, or modifying other rows in other tables. Discovering these issues without knowing that they could exist is often done around midnight during a production outage, because a new feature was released that did something in the DB that was presumed safe, but wasn't. If the code for the business logic were in one place, the application, it would been much easier to see what the change would actually do.

I understand entirely that performance gains from good DB use are astronomical, but in my humble experience, I've found them to be more trouble in the long run than they are worth.

e: spelling


> VCS for stored procedures and views is usually non existent

This is a mindset problem, not a technology problem. Treat your stored procedures/functions/views like code, not like data. Keep a separate "code" schema, put your procedures/functions/views definitions in files, and store your files in Git, just like Java or Ruby code. Deployment then becomes an automated drop and recreate of the code schema in a single transaction with zero downtime.

> Deploying application logic alongside these systems is very painful

This is not my experience at all. The stored procedures form an API. If changing the API breaks your application, you are doing something fundamentally wrong. API versioning should only be necessary if third parties depend on your API, but I wouldn't recommend giving third parties direct access to the database anyway.

> Once you have complex enough stored procedures, modifying the data changes things in unexpected or unintended ways

I assume you mean triggers because stored procedures don't happen as side effects, you have to call them explicitly. Regarding triggers, I agree with everything you say.


Re: VCS, I’ve found that with a small amount of setup, tools like Liquibase[0] allow you to maintain functions and views (with “CREATE OR REPLACE” or equivalent) as SQL files in a file system with full VCS, diffing, etc. just like any other code.

[0] https://www.liquibase.org/


You still need to sequentially define your changes (migrations) which isn't exactly the VCS experience people are used to. If all changes to your db require new file entries, rarely do you need to diff let alone bisect to find out the history. It does make it harder to find out the current shape of your app though with all the ALTER statements to a single table strewn about multiple files. I wonder if it's feasible to write a tool that allows you to generate sequential migrations from DDL organized like tradition code with modules and items dealing with the same domain in the same file after diffing it with the sequential migrations already in the codebase.


No you don't have to structure your project that way. You can have all your logic (views/functions) stay in a single file per, and use git like usual with the standard diff you are used to. Your changelog xml just needs to be setup for it.


So, after digging about in the docs, I found the Best Practices[0] page which lays out what you describe. I was disappointed to find it is just using idempotent creation statements for procedures and views and putting them all in files that are re-run on each migration step-along with each new roll forward migration-so that new versions of these logic objects replace any old ones. This is not exactly something that liquidbase provides, should be easily replicatable, and I was hoping it'd do diffs to provide similar experience on tables as well.

After some web searching, I came across a project[1] that's trying to do what I describe but it appears to be dead. I'm surprised that migration tools (that I looked through) don't already support such a basic approach that I suspect I'm missing something obvious. Some fundamental block in how SQL works or maybe it doesn't work that well in practice for, in concept, it sounds easy to script using what they call a shadow db and `pg_diff`.

[0]: https://docs.liquibase.com/concepts/bestpractices.html

[1]: https://github.com/cbowdon/ddl-diff


Flyway solves 1.

As for 3, I would prefer to "go to" procedure rather than "come from".


> prefer writing plain SQL to using ORMs

Oh, ORMs are not the solution. I agree, ORMs often bring new issues (like, loading too many entries in memory), and don't solve many problems (encapsulation).

> I don't ever want to have a lot of SQL logic in my code.

That works if you are a small team. With big teams / multiple teams, the small problems become a huge problem: someone / some team will write huge statements, will use weird syntax, and so on. (C++ has similar problems btw) Specially the lack of proper encapsulation can become a huge problem.

> Some statements do need to be very large, for efficiency.

Well, then the problem is that SQL doesn't give you a good way to avoid large statements.

> there are a million ways it can be abused

Exactly. Other languages (like Java) make it harder to abuse the language. If you really want, you still can; but it's easier to refactor and maintain.


> Writing SQL is like uploading a program; just like a shader to a GPU... But unlike a GPU, you have full flexibility to permanently store your logic right there in your SQL DB.

I agree - this is a really interesting way to put this.


Here're little things that would make my life better when I'm generating SQL queries:

1. Support trailing commas.

2. Add operator `is` which works like `==` but for nulls it returns `true`. Could also remove automatic type casting, I'd be happy about that, never liked it.

3. Support trailing `or`-s and `and`-s.

So query could look like

    select a, b,
    from t
    where (a.x is :x and a.y is :y and) or
          (b.x is :x and) or


Trailing "and"s and "or"s looks so wrong to me; it reads so strongly like a syntax error. How about Boolean functions instead, like so:

    select
      a,
      b,
    from t
    where any(
      all(a.x is :x, a.y is :y,),
      b.x is :x,
    )
This way you get the uniformity you want from trailing commas, without the syntax looking quite so weird.


That's even better.


As an aside, avoid OR as much as possible in your WHERE clauses. Adjust your relations if necessary. ORs can really blow the planner up and slow things down. ANDs are reductive wrt indexes. ORs make multiple passes on the indexes necessary.

It's not always possible to avoid, and it makes perfect sense sometimes, but it'd be better if folks didn't lean into OR so much.


What is the reason for trailing commas and ands/ors?


So that you can generate SQL more easily, and (for hand-written statements) so that you can re-order lines more easily. Many programming languages support trailing commas.


> There is little innovation in the database space: there are hundreds of general-purpose programming languages, but very few "database" languages.

How many of those are "innovative"? How many of those have an ecosystem around them that can be taken seriously?

There's maybe one innovative language per decade. Minor improvements to an existing language (with a load of unnecessary syntax changes) don't make up for programming in a barren ecosystem. And a bunch of wrappers around C libraries don't make an ecosystem, they don't allow the language to materialize any actual new paradigms and just result in a C dialect.


> How many of those are "innovative"?

In programming languages we have imperative and functional languages, type-safe / untyped, memory managed / unmanaged, low-level vs high level, interpreted / JIT / ahead-of-time. So much innovation! And SQL is almost 50 years old, and basically still the same (plus adding features). And now maybe GraphQL. Very little innovation.


> "from" should come first to help auto-complete

I may be in a minority, but I like the projection list being first in simple queries. To help with auto-complete just type the other parts first then fill in the select list after? This falls apart with CTEs though as they have to come first syntactically, so now they exist maybe at the end is the best place for the output list so it is easy to find in a complex statement.

> Many developers end up writing huge SQL statements

Compossibility can definitely be a significant problem with SQL. There are structures that can help significantly, views, CTEs, etc, but they all have either potential maintainability issues or potential performance issues.

Partly, from a performance PoV, this is an issue with the query optimisers. On the maintainability matter I think the syntax for CTEs and sub-queries generally could have been much better thought out – I've seen a few attempts to beautify (or replace) SQL that essentially amount to moving parts around so you can make CTEs & sub-queries look more like functions/procedures in imperative languages (transpiling the result back to SQL for submission to the database after).

> Regular programming languages don't require "query optimizers".

Because regular programming languages are “lower level” in that regard and require you to do that optimisation yourself. The point of a language like SQL is that you describe the shape and content of what you want and don't have to worry about the most efficient way to retrieve that information.

And come to think of it, many regular languages do have optimisers. Think of all the work a compiler is doing or the JIT compiler in JS engines and their ilk – they don't just blindly interpret each line of C/other into a distinct selection of CPU instructions. Consider declarative frameworks for those languages too, like many used for SPAs which deal with DOM manipulation optimisation for you, and the many attempts at no-code or low-code development solutions which seek to abstract such concerns away from the solving of problems that require programming (by doing what SQL attempts: trying to translate what you want, stated as simply as possible, into what is needed to do that efficiently).


> regular programming languages are “lower level” in that regard and require you to do that optimisation yourself. The point of a language like SQL is that you describe the shape and content of what you want and don't have to worry about the most efficient way to retrieve that information.

Yes! But I wonder: wouldn't it be better if the developer has to specify how he wants to retrieve the information? In a regular programming language, typically people don't complain too much about this "missing feature" of a query optimizer. It's about modularization / abstraction. In SQL, the data model is king; in a regular programming language, the API is king. Yes, compilers have optimizers, but they optimize the machine instructions, not (so much) the algorithms that are used (scan vs using an index / using which index).

Maybe I just have seen too many cases where developers didn't create the necessary indexes, or databases did't use the (obviously) best indexes, or developers writing 10'000 lines SQL statements (that I then re-wrote in Visual Basic, and the resulting program was much faster). Or 1000 line GraphQL query that returns 50 MB of JSON. Yes, I have also seen Java developers writing exponential algorithms or worse, and running into memory issues - but I personally found it easier to analyze and optimize then SQL. Maybe I have some bias thought.


> Yes! But I wonder: wouldn't it be better if the developer has to specify how he wants to retrieve the information? In a regular programming language, typically people don't complain too much about this "missing feature" of a query optimizer.

On the other hand, as you noted, SQL is forever while programs and programming languages come and go. Maybe having a declarative data mindset is appropriate since data is more valuable and durable than code.


> > regular programming languages are “lower level” in that regard and require you to do that optimisation yourself.

> Yes! But I wonder: wouldn't it be better if the developer has to specify how he wants to retrieve the information?

I'd be against making SQL lower level in that sense – part of its purpose is to be back-end agnostic and once you start letting the dev get too details you are locking code to one implementation. Of course this isn't how things generally work anyway beyond the basics, with engine specific hints and such, but let us not intentionally make that more of an issue!

Though if we do go deliberately back-end specific, perhaps a way of describing a query plan more directly and saying “use this” is what you are wanting. Something like plan forcing in SQL Server's query store, but with you actually dictating the plan not picking one the engine has created from SQL. That way SQL stays purer but you have the option of something more direct, like mixing bits of (obviously platform specific) assembler into your C or other higher-level code. There will still be terrible code created that way though – possibly much worse.


> part of its purpose is to be back-end agnostic

Low-level languages can be backend agnostic. For example C or Rust allows to write low-level code that is back-end agnostic, and doesn't cause cause vendor or technology lock-in.

What I like to have is guarantees similar to what a RTOS (real-time operating system) has. Maybe using hints, or something like that. This could be standardized. Languages like C or Rust have such guarantees naturally: statements have some kind of guaranteed time constraints (that's why RTOSes can be mostly written in C). Sure, loops can be endless, but loops, e.g. in Rust, can't become 1 million times slower suddenly because the backend changed.


> But I wonder: wouldn't it be better if the developer has to specify how he wants to retrieve the information? In a regular programming language, typically people don't complain too much about this "missing feature" of a query optimizer.

Why not use a key–value store or just a file system, if you want to write everything yourself?


I don't want to write everything myself. (Well, I did write 3 relational database engines, and 4 query optimizers... Hypersonic SQL, PointBase Micro, H2 database, and the Apache Jackrabbit Oak query engine.)

I want a good abstraction (API) that allows to write programs that access and manipulate data in an efficient and simple way. But I argue it shouldn't be a declarative set manipulation language like SQL, but a language that is more low-level and makes it hard to (by mistake) retrieve or update millions of rows, or makes it hard to write statements that are very inefficient, and makes it hard to have systems that lack encapsulation, so you end up having hundreds of tables that are entangled in a big mess.

I do like the fact that network roundtrips can be reduced: you can send SQL (or GraphQL) statements from the client to the server, where the statements are processed and the result is returned. But it's too easy for the result to become large, or statements too slow to be processed. With a key-value store (like Redis) you can't do that. Or can you send a list of commands to Redis (or some other key-value store) that are executed in a safe and efficient way (with guaranteed runtime) and the result is returned? That would be what I look for.


Existing query optimisers are very weak. You can write multiple semantically equivalent queries but some will be orders of magnitude slower. I would pay good money for a Postgres that I could tell to spend several minutes analysing a query before caching the query plan and reusing it for the future, instead of having to spend programmer time tweaking things.


query plans are data-dependent. mssql does cache query plans in some cases and it's a major caveat in using e.g. stored procedures (you have to make sure a good-enough plan is cached.)


What I’m saying is the plans have a tiny budget devoted to their optimisation. For something like a view, I would happily let the optimiser run for hours finding a speedup if it were then reliably delivered. As it is, none of that happens and you often have to rewrite to a functionally identical but faster query. The furthest you can go on something like Postgres is tweaking some config and turning off stuff like GEQO in favour of brute force etc.


> if it were then reliably delivered

This is the limiting factor there. The best plan is going to vary depending on data patterns and (if variables are involved) input parameters. You would have to spend a potentially infinite amount of time testing for different patterns and store which plan is best for each, or do the analysis for every possible parameter each time the referenced data changes.

> plans have a tiny budget devoted to their optimisation

The reason planners get minimal time for planning is that in the general you quickly hit the point of diminishing returns. Their job is to find something good enough and find it quickly. They are helped by pre-computed hints like index stats, so you could perhaps spend more time building other helpful pre-computed data but you would need to reassess this data regularly (as index stats are reconsidered under certain conditions) and again you hit a point of diminishing returns in terms of the amount of time you spend on this as data it modified compared to the potential benefits from better plans.

> For something like a view

Databases do not create query plans for views when the view is created. They create query plans for overall statements that use the views when those statements are executed (or chose a cached one if the statement is sufficiently similar to a previous one for which there is a cached plan).


This is a very pessimistic argument, undermined by the fact that every single day, humans rewrite queries - in the supposedly declarative language SQL - to be identical but faster. I want a query planner that can be told to go away and try to do better with more resources. I don't know why this is unthinkable.


Yes, just like in browsers where one API is faster in Chrome and slower in Firefox or vice versa. Then Safari drops in and completely upends the notion of "best strategy".

The underlying implementation will always affect top level access performance. "Best" will always be implementation dependent.

That said, the DBs I use most (Postgres, MySQL, SQLite, and their derivatives) have pretty consistently improved performance from version to version with surprisingly few regressions.

Also there have been more "knobs and dials" added to get what you want.

  WITH foo MATERIALIZED AS …
springs immediately to mind. The ability to set or remove an optimization fence does wonders while still maintaining SQL's declarative roots.


Cache multiple and choose the one that fits the expected data distribution based on the inputs.


> SQL has many problems: > "from" should come first to help auto-complete;

Agreed

> null handling is complicated;

Mostly the issue is that null in dbs has different semantics than all programming languages. I wish programming languages used the db style of handling nulls. In dbs, null means that the value is unknown and can't be reasoned about. DB uses nulls in how people reason arguments: "Joe's age is unknown, Jane is 25 years old. Are they of same age? We can't say if their age are equal, or unequal, because we don't know."

> the syntax in general is weird.

I find syntax to be like any time switching a language. Python is weird to C++. Atleast with SQL, I find usage docs more understandable than many programming languages.

> Compatibility is a problem

This sounds like a complaint that two different programming languages are not providing same features. Or two operating systems have different UI toolkits (and different menu conventions, like where should the 'preferences' be under).

I wouldn't expect sqlite to be Oracle / MSSQL / PGSQL compatible.


> In dbs, null means that the value is unknown and can't be reasoned about.

Except it can also mean it's `null` in that the field is optional and the record doesn't have it.


Not really. A null value in a certain column does not indicate that the field is optional; instead, the table definition having a nullable column indicates that the field is optional. A null value in a nullable column indicates only that the value is unknown, nothing else. This is database modeling 101:

A table ("relation") describes one entity. The attributes of the entity are reflected in the table column set, and each table row represents a different instance of said entity. Marking a column as nullable means that not knowing that attribute is a recoverable situation (for example, because it is only stored for informational purposes and not for processing).

If you're using the same table to store multiple entities, using nullable columns to indicate if a row represents entity A or entity B, you're doing it wrong.


An optional value just means that the value is unknown.

For example, it is optional to enter the colour of the car I own. So my car colour is null. The database can't confirm if my car is Red or isn't Red, because it is an unknown value.


> There is little innovation in the database space: there are hundreds of general-purpose programming languages, but very few "database" languages.

That's one way to look at it, but another way is to consider that, as a sort of assembly, SQL is as good a language to express relational algebra as any, given that many devs don't usually write a lot of raw SQL themselves.

When you look at the ORM and query builder space, there are a lot of different solutions - you don't have to like them (for example, I don't generally like ORMs), but they definitely exist, and they have all sorts of capabilities for type-safety, composability, and so on.


I've been playing around with EdgeDB lately and it addresses a lot of the concerns you laid out. Might be worth a look.


>SQL is used to reduce network roundtrips.

Actually, SQL is used to hide DBMS implementation complexity and diversity. It was a great abstraction when RDBMS tech progressed rapidly. Now it has matured to stability, and it makes sense to try a better language.


Yes, it's also used to hide implementation complexity and diversity: the advantage of a standard (even a "weak" one).

> SQL is used to reduce network roundtrips.

What I mean is: the database has an API that allows the client to specify (almost) arbitrary complex queries and operations. Regular REST APIs don't support this. GraphQL also supports it. The client sending queries to the database is a great way to reduce network roundtrips. Key-value stores that only support get/put/remove/list require a lot more network roundtrips for slightly complicated operations.

> it makes sense to try a better language.

I argue that the language itself (the syntax) isn't the main problem: part of the problem might be that SQL (even with a better syntax) would _still_ have very similar problems: missing indexes / wrong query plans, missing encapsulation which results in a complex mess of entangled code, things that work fine with small data sets but are extremely slow with large data sets,...


RDBMS has matured? I feel like I'm seeing a new database pop up every quarter.


>> RDBMS has matured? I feel like I'm seeing a new database pop up every quarter.

RDBMS's have been in use for more than 50 years:

https://en.wikipedia.org/wiki/Relational_database

There are hundreds of implementations some of which are decades old:

https://en.wikipedia.org/wiki/List_of_relational_database_ma...

Oracle and several other companies' (some of which are now defunct or were acquired by other companies) primary products were RDBMS software.

New implementations do not indicate lack of maturity but perhaps the wide availability of tools and libraries to build RBDMS's and their utility.

Perhaps you are referring to database systems in general and not relational databases?


Right my bad, I was thinking of storage engines in general. And of course maturity doesn't mean lack of innovation - which is still happening in the relational space.


At least with the substrait[0] project there is some push for standardization & interoperability when it comes to the relation algebra and query plan.

I think this can unlock a lot more experimentation when it comes to SQL-alternative "languages" or approaches where plain text languages are just skipped all-together and are instead replaced by good libraries that act directly on query plans.

[0]: https://substrait.io/


Well, I'd argue that most of the time SQL is treated as a glorified CSV.

No-one really looks at it as a real language, which it actually is.

As for optimization, I'd say that it might be too efficient for its own good. As now one can have its ORM generate a huge request and complain to the DB if the SQL is slow.

If you write in any other language, you usually don't complain that the processor is badly optimizing the asm execution. But you blame the compiler. ORM on the other hand are mostly never blamed, just pitied if they fail.

Having Wasm inside the RDBMS might be a much better approach. But then it will be up to the wasm program to be efficient.


> There is little innovation in the database space

Have you even used Jetbrains' IDE's DataGrip?


> There is little innovation in the database space: there are hundreds of general-purpose programming languages, but very few "database" languages.

What we have is generally good enough for the purposes of set manipulation. SQL is not a programming language, and should not be approached as a programming language. It's a way to express relational algebra.

This is similar to complaining about any other system of algebra having issues... and wanting to come up with a New Way. We don't really need a new way to describe linear algebra; what we have works. And it may seem unintelligible and a mess from the outside, but if you take the time to learn it, you'll understand why and how it's laid out the way it is, and learn to get over it.

The other point being, there are a set of operations you can do on sets. SQL is as low level as you're going to get in regards to that. Any new language would either: change the labeling of the underlying operations (e.g. changing joins as a label for Cartesian products to some other label), abstract away the low-level operations into something completely else (foolish, in my view), or tweak it in minor ways to suit the preferences of the individual (again, not worth the effort in my view, just learn the syntax). All of these are inefficient; and all endeavors to make it more efficient have failed (as far as I can tell) -- primarily because the people working on these projects (such as your self) do not have a sufficient maturity in this space to understand why things are done the way they are, i.e. "Chesterton's Fence."

> Many developers end up writing huge SQL statements (one statement that is hundreds of lines). You wouldn't do that in a "proper" programming language; but in languages like SQL (or maybe Excel) this will happen.

Programming language and set manipulation languages are incomparable. The first is a language for describing procedural instructions and the last is a way to describe set transformations. It's not wise to apply the same set of standards to both. It's like complaining a proof is tediously long, when the actual underlying operations are simple -- you're missing the point. SQL has to be long and detailed -- when you want to be sure the data you're receiving is exactly the way you specified it to be (i.e. no compiler shenanigans where it turns your rough intent into concrete instructions).

However, I will concede that I've seen some monstrous SQL in the wild... mostly written by people who don't really know SQL... and which could've been greatly shortened by knowing the little tricks that are database-specific (and similarly, I've seen enough people request a dataset to do processing on the server-side in herculean efforts, which could've been done faster and written quicker in the database).

> Another problem is proper encapsulation. Views can be used, but often developers have access to all tables, and if you have many components this becomes a mess. Regular programming languages have module system and good encapsulation.

I don't understand. Permissions can be tweaked however which way you want. Am I missing something?

> SQL statements often don't use indexes, for one reason or the other. With regular programming languages, it is harder to end up in this position (as you have to define the API properly: encapsulation). Regular programming languages don't require "query optimizers".

This is a trivial problem; but if you don't know why it happens, I can understand being befuddled by it. Query optimizers are involved so your access to the underlying data -- and the manipulations on it -- are done in the least costly fashion (i.e. in a rough sense to reduce the algorithmic complexity of your SQL to its least possible complexity) in regards to the set of hardware your cluster is running on, and the various access statistics involved. Regular programming languages do not have "query optimizers" -- but various libraries do have their own optimizations to reach the same end (as do compilers... compilers are nothing but optimizations upon optimizations in the same vein).

> SQL is used to reduce network roundtrips. Clients send SQL (or GraphQL) statements to the server, which are actually small programs. Those programs are executed on the server, and the response is sent to the client. (Stored procedures and for GraphQL persisted queries can be used - but that's often happening afterwards.) Possibly it would be better to send small programs written in a "better" language (something like Lua) to the server?

It would be very inefficient for general purpose. For most cases, the query optimizer can take your SQL and bring back your result sets in the most efficient way possible (in relation to the SQL you've written).

If you want to do the underlying operations yourself, instead of relying on the query optimizer, then you can use a compiled language like C and write your own extensions -- though it would take much more effort to do it properly than simply learning how your database works, and how to work with the query optimizer (rather than against).


SQL is a programming language. It's simply one of if not the only successful fourth generation programming language in a world replete with third generation ones.

The error is confounding a fourth generation DSL for set theory with a third generation general purpose application programming language. There's a really good reason ORMs have their impedance mismatch with relational databases.

Aside from this minor nit, your comment is spot on!


> What we have is generally good enough

Well, that can be said about any programming language (assembler, C,...) but yet there is innovation in this area still. And that's good. But in the database space, there is little innovation. And no, I don't think what we have (SQL, GraphQL) is good enough.

> SQL is not a programming language

Yes and no. SQL is an API that allows changing the state. But (same as programming languages such as C) it has advantages and disadvantages.

> if you take the time to learn it

Again, that can be said about assembler as well :-) Or Excel. You can do a lot in Excel.

> SQL is as low level as you're going to get in regards to that.

Absolutely no: SQL is a high-level language. And I argue you don't need a declarative set-manipulation language at all, to store and retrieve data. Computers are not set-manipulation machines: they are instruction-based machines. And the popular programming languages reflect that: the most popular languages (Python, C, C++, Java, C#, Visual Basic, Javascript) are _all_ imperative languages, which is close to how computers operate. And that's relatively easy for people (developers) to understand. Functional languages are not that popular. So, SQL is an exception here.

> SQL has to be long and detailed

I argue the reason is that SQL is not a very good abstraction, if you need extremely large statements that are hard to understand and almost impossible to maintain. Same for Excel.

> I will concede that I've seen some monstrous SQL in the wild... mostly written by people who don't really know SQL

Yes. I argue that SQL _favors_ (or at least allows for) writing complicated, messy code that lacks encapsulation. In a larger company, you will have some percentage of people that are not all that great, and you don't want to pick a language that will easily result in a complete mess. Languages such as Java favor modular, well encapsulated code: it is hard to make a complete mess in Java.

> Permissions can be tweaked

Permissions are not an encapsulation mechanism, in the way "private" fields in Java / C++ are protecting other programs for accessing internal.

>> SQL statements often don't use indexes > This is a trivial problem

Well, for the experts, yes. But not for the novice programmer. And you will have some percentage of novice programmers. In a larger shop, you typically (still) need a "database administrator" or a group of (experienced) people that ensure indexes are made correctly.

> various libraries do have their own optimizations

Libraries typically don't have optimizers. Compilers have, yes, but those are not picking indexes or such that can result in one million times slower or faster runtime. The compilers typically can speed up things by a factor of 10, at most (constant factor). Query optimizers are very different. (Again, I wrote 3 relational databases, and 4 query optimizers.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: