Open Source SQL Parsers

tehlike · on April 22, 2022

Not related to parsers, but i find sql syntax so backwards.

Listing columns first Then table Then join Then filter Then group bys Then limit

The order of operations are out of whack and makes pipeline ing a little hard.

I found this to be closer to LINQ way

I hope in near future databases will come with better query languages...

tfigment · on April 22, 2022

This comes up for me in autocompleting editors. Have to start with "select * from table t;" then go back and then remove * and t.<ctrl-space> to get the hints to populate and then continue with joins which may require going back to select. Update and delete at least start with tables and joins.

vkazanov · on April 22, 2022

...and there's more to sql trouble than just troublesome syntax, typical for languages of its generation!

1. Sql is non-modular

2. Non-orthogonal

3. Its parroting of the (beautiful) relational data model is... opinionated at best.

The standard is unreadable and not really implementable, one always has to resort to implementation docs.

Realistically there are no alternatives, but... sql is no good.

ErwinSmout · on April 22, 2022

Imagine you need to write a query to answer "what are the cities with more than 4000 inhabitants".

What more natural could there be than to write this as (old 1992 correlated query style)

SELECT cityName FROM Cities C WHERE ((SELECT COUNT() FROM Inhabitants I WHERE I.cityName = C.cityName) > 4000)

Agreed ???

YOU CAN'T DO THIS IN SQL [standard version] !!! You must write

SELECT cityName FROM Cities C WHERE (4000 < (SELECT COUNT() FROM Inhabitants I WHERE I.cityName = C.cityName))

This makes SQL about the only language in the world that allows you to write "a<b" in certain places but forbids you to write "b>a" in those very same places. Some professional piece of language design.

(Disclaimer : I don't know how matters are in SQL:2016 wrt to this issue, but it drove a former IBM employee who was at the time a member of the SQL standards committee to (a) start work on TTM and (b) leave IBM as soon as he could.)

rzmmm · on April 22, 2022

What do you think about using SQL views and functions to make composable, modular SQL code?

tbrownaw · on April 22, 2022

It doesn't work very well.

Table Out1 has translated versions of fields AA..AZ from table In1.

Table Out2 has translated versions of fields BA..BZ from table In1.

table Out3 has translated versions of fields X..Z from table In2, and the translation depends on some of the AA..BZ fields, in both translated and untranslated forms.

We'd end up with views that depended on other views and after expanding had like a dozen self-joins to add more fields that weren't included in the base view. And views that were selecting a big list of columns unchanged, plus adding a couple.

But somehow despite being an absolute ^%#@!^&$! it still worked better than the pointy-clicky ETL tooling we'd been using previously. Mostly because of the run->debug->modify cycle time being faster, and only a little because of avoiding pointy-clicky stuff.

vkazanov · on April 22, 2022

1. Views are fine. They somewhat help with abstracting things away here and there, yes. But these are just macros, right?

2. What SQL functions are you talking about? There are builtins, e.g. avg, sum, and then there are User Defined Functions, which have nothing to do with SQL, mostly just a random external language or one of the non-portable PL/SQL flavours.

Practically speaking, one cannot really write portable SQL outside of a very limited intersection of various dialects. And within this limit users just cannot build a reasonably portable code library.

There is a reason why analysts working with SQL just copy massive query templates again and again and again...

nickcox · on April 22, 2022

> The order of operations are out of whack and makes pipeline ing a little hard

It makes it really hard to offer good autocompletions too.

technion · on April 22, 2022

This feels a lot like why Kusto was invented. It's very close to SQL, but written the other way around with the table name first, then the 'where', and finally the columns.

https://docs.microsoft.com/en-us/azure/data-explorer/kusto/q...

tehlike · on April 22, 2022

Correct looks like it's trying to be the linq of sql. Looks good. MSFT really knows how to make programming tools...

cryptonector · on April 22, 2022

I don't mind SQL, really, but I would prefer a query construction API that has the full power of SQL (and more even). At the very least such a thing would not have a SQL injection problem, but also could be used to generate ASTs.

  Query q = db.from("foo").join("bar").using("id")
    .select("id","foothing","barthing");

However, the moment you want any non-trivial SQL expression, such an API becomes unwieldy. And yet the need for non-trivial SQL expressions in queries can't be avoided, nor can it be left to the host language.

Also, I'd really like a SQL mode where no literal constants are allowed, as a mechanism to force the use of query parameters and prevent SQL injection.

ErwinSmout · on April 22, 2022

That "the API becomes unwieldy" is in fact inevitable. In order to support .where(<boolean expression>) as well as .select(<any expression>, ascolname) you need a way to pass a parameter of type something-like-expression to the .where() and .select() methods, and lambdas or at least something like them appear to be the right way to do that (e.g. .where((foothing) -> (foothing<0)), but languages offer no means to "introspect into the lambda" which is what is needed to translate it into SQL.

Plus (assume my .where() example was Java), the host language compiler is never going to find a way to cope with the declaration of the 'input' of the 'lambda', *precisely* because [the definitions of] those things are outside the scope of what the host language compiler knows about. So you need a way to do something like "import my-db-definition;" And then you need to put machinery in place to verify at runtime that the definitions as they are for the db, are still the same (or compatible) with the definitions that the program was compiled with. Etc. etc. etc.

cryptonector · on April 22, 2022

> lambdas

Lambdas in the host language won't play well with remote RDBMSes -- you'd have to be able to serialize the lambda and make the serialized form reasonably efficient. I'm skeptical of lambdas for expressions in DB query APIs.

An expression like `foo + bar` has to become `.expr(plus("foo", "bar"))`, except, if you take this to its limit you'll want to use non-string objects to identify column names and other such things, and, again, it quickly becomes unwieldy.

You cover some of this in your comment, so I think we're in agreement:

> That "the API becomes unwieldy" is in fact inevitable.

And yet an API is kinda desirable.

Ultimately I think a query language with no constant literals, only query parameters, compiling to a standard AST that can also be constructed by APIs, may be the best way forward. One could then write in a QL to start with, compile to an AST, serialize into a host language if desired, modify and use that, or always use the QL, or even always use the API, unwieldy though it would be. Then lambdas could actually be in the QL and compiled on the fly as needed:

  Query q = db.query();
  
  q = q.from(...).join(...).using(...)
       .select(...);
  q = q.where(q.expr("foo + bar < baz"));

Ok, that snippet has a problem if we want `q` to be a unique pointer. Let's fix it up a bit:

  QueryScope qs;
  Query q = db.query();
  
  q = q.from(...).join(...).using(...)
       .select(...);
  
  /* Get a context of in-scope identifiers, types, ... */
  qs = q.getWhereScope();
  
  /* Now we can compile an expression string */
  q = q.where(qc.expr("foo + bar < baz"));

And a hybrid QL + API might look like:

  QueryScope qs;
  Query q = db.parse("SELECT ... FROM ... JOIN ... USING (...)");
  
  /* Get a context of in-scope identifiers, types, ... */
  qs = q.getWhereScope();
  
  /* Now we can compile an expression string */
  q = q.where(qc.expr("foo + bar < baz"));

ErwinSmout · on April 22, 2022

"And yet an API is kinda desirable."

Yeah. however I may not live to see the day when that desire, which I agree is felt by 99.99% of the developer community (hell, even by 99.99% of the end user community because don't come and tell me that what that community is feeling isn't some sort of sense that "the developers just can't offer us any answers"), actually gets to be fulfilled.

"compiling to a standard AST"

I think that what you might be failing to appreciate is that achieving that requires a "standard algebra", and that no such thing exists at this present day.

cryptonector · on April 23, 2022

> I think that what you might be failing to appreciate is that achieving that requires a "standard algebra", and that no such thing exists at this present day.

Oh I appreciate that. What I've in mind is something like the SQL standard (which isn't exactly universally adhered to) of QL ASTs.

tehlike · on April 22, 2022

Linq in c# fixes that.

ErwinSmout · on April 22, 2022

I doubt that it does. There is way more to "integrating [host] language and queries" than the MIN() of what Micro$oft engineers are (a) capable of understanding and (b) allowed by their own management to put in the products they come up with.

antonvs · on April 24, 2022

Your comment may be true of many things that Microsoft does, but not LINQ.

The creator of LINQ was Erik Meijer (https://en.wikipedia.org/wiki/Erik_Meijer_(computer_scientis...), who was a leading researcher in programming languages and functional programming before shifting into industry. He's well known in the Haskell world for e.g. his work on bananas and lenses, and many other things.

LINQ (https://en.wikipedia.org/wiki/Language_Integrated_Query) is basically an embedding of a general monadic framework, augmented with a set of query-specific operators. It can process arbitrary data sources - quoting from the link: "arrays, enumerable classes, XML documents, relational databases, and third-party data sources."

In addition, because it's fundamentally monadic, it can be used to express all sorts of computations, by: "utilizing query expressions as a general framework for readably composing arbitrary computations, include the construction of event handlers or monadic parsers."

LINQ has been ported to PHP, JS & TS, and even Actionscript (although the ports aren't necessarily as capable as the version on .NET, which has language-level support for the features.)

If you're interested in this general subject, I guarantee you you have a lot to learn from LINQ and the research work behind it.

ErwinSmout · on April 27, 2022

Dude, please, give me a break.

Erik Meijer is that, eurhm, person, I remember from publicly stating that his favourite resarch method is "throwing things at the wall and see what sticks". WOW. That's the way of the true academic. Of course it might have been the case that he was being really truly utterly facetious/cynical, but I certainly don't recall having any sense of that, not even remotely, upon reading that remark. (And if it means anything to you, I don't recall someone like Dijkstra ever writing anything like that anywhere. But of course Dijkstra was one of the last *TRUE* academics.)

And he is also that, eurhm, person, I remember from "All your databases are belong to us", which drove Chris Date to writing the response he did (published both in the ACM and in his own book "Stating the obvious") in which he publicly shamed the ACM itself for lending its pages to such sheer utter nonsense.

You might want to read the substance. If it shows anything at all, it's the absolute absence of value there is to be found in having a PhD these days.

And there is very, VERY, little in your reply that does not fall either into the category of "argumentum ad verecundiam" or else into that of superficial handwaving.

antonvs · on April 28, 2022

You can't really complain about argument from authority when using another authority to rebut the first one.

Date rather missed the point though - in that article, Meijer clearly identified the duality of the "data modeler" vs. "programmer" perspective - "duality" implying that these were two different perspectives on the same problem. He was saying that the programmer perspective had needs, created by the new industry landscape at the time, that weren't being served by existing relational databases. And he was correct on that point, as the rise of NoSQL solutions and databases such as BigQuery have demonstrated. If Meijer was wrong, then the entire industry, including Google, AWS, etc. have been wrong for a couple of decades now. But really, Date was partly indulging in a shooting of the messenger, and partly a nitpick at what he saw as a mischaracterizations. Date's responses were largely irrelevant to Meijer's main points, and he should have known better.

Anyway, I notice that none of your excuses to avoid learning are actually technical objections to what I was recommending. Why fight so hard to find reasons not to learn?

ErwinSmout · on April 28, 2022

No, Date did not "miss the point". He properly identified that all of Meijer's arguments he used to make his point were in fact complete bunk. And he clearly and factually answered why that was so, on a blow-by-blow basis. You call that a "nitpick at what he saw as mischaracterizations", I call that "identifying the blatantly ignorant elephant in the room".

The only perspective should be "how can we get information from users (or any other form of information-capturing device in fact), make records of that information and keep those records as long as the information may be needed [and as long as it pays off to keep them], so that we can later get that information back to same [or other] users". Let's call that perspective "HG" for "Holy Grail".

Codd's intent has always been to achieve HG. Date's intent has always been to achieve HG. Both properly identified the core asset we work with : data. It's always ultimately about the data. Both consequentially properly identified the first step needed to make achieving HG possible : a model of data built on foundations of mathematics. (To quote Codd : "database design is not going to be possible if the only concepts available are bits and bytes".) The model of data they came up with is the relational one. And achieving HG is not going to be done by ditching the model of data. On the contrary. It might be achievable by replacing it with some other model of data, but that is going to require exhaustive demonstration of how and why the replacing model of data is better than the relational one, and delivering such exhaustive demonstration is going to require understanding the relational one in the first place. Of the entire mob that is into what you call the "programmer perspective", there is not a single individual that does.

And Date's point in his reply was : neither does Meijer. Backed by evidence, so it was neither "shooting the messenger" nor "a nitpick" but unfortunately it might require understanding the RM to understand the evidence.

To close, I have some hints for you : (1) there ARE NO "existing relational databases" (2) "BigQuery" is not a "database" but a "DBMS" (there's a Dijkstra quote somewhere about how mastery of mathematics and mastery of language fortuitously tend to always come hand in hand - and there's a corollary to that) and (3) yes, the entire industry has indeed been "wrong for decades" because the RM has never been properly understood. See the McGoveran quote on www.dbdebunk.com. Ironically, this grave mistake might be due exactly to "too much programmer perspective" at the time those pseudo-relational systems were built.

cryptonector · on April 22, 2022

I also doubt it, though I know nothing about Linkq so I'm loathe comment except in so far as what makes the problem hard (see comments above).

tehlike · on April 22, 2022

Microsoft is surprisingly good for creating good developer tools and infrastructure.

ErwinSmout · on April 22, 2022

In the perception of the Micro$oft users who are brainwashed with the idea that what Micro$oft does is good for the developers.

(In fact, it might even be outright true. But that's not a guarantee that what the Micro$oft users do with the Micro$oft tools is necessarily also the *BEST* thing for the user of that software product that the Micro$oft users produce.)

tehlike · on April 25, 2022

You are being unnecessarily cynical. I left microsoft ecosystem about 10 years ago, but still miss the developer environment they had in 2012 today. Nothing still comes close.

ErwinSmout · on April 28, 2022

I invite you to inspect, even if only cursorily, my above exchange with mr. antonvs. My point is that Micro$oft was created by programmers (in fact in its earliest days it was mostly a compilers company), that to this very day the vast majority of what it does falls into the category "by programmers, for programmers", and it may indeed be the case that they are extremely good at that. And if you are yourself a programmer, then it is obviously only self-evident that you are then going to praise/applaud Micro$oft for what they do. It's just that this "programmer perspective" may yield a hopelessly narrowed down tunnel vision on what the real goal should be. That real goal being : making complete sense to the final end-users. And as an example of "not making complete sense" : I open an xls file with multiple sheets in it. I browse through the contents, paging up and down through the sheets. I'm done browsing and I close the file. Microsoft asks me if I want to save my changes. The reason of course (or presumably) being that somewhere hidden inside, there's a "current sheet" indicator and it is considered part of the data because it gets saved alongside the actual sheet contents, so the file can open up with the last sheet displayed already active. But you show me one single user (and especially one not coming from an IT background) who never once in his life reacted to that with "WHAT FRIKKING CHANGES ARE YOU TALKING ABOUT ? I CHANGED NOTHING.".

zmxz · on April 22, 2022

In my case it's the opposite - SQL is perfectly readable and fits my mental model while the project you linked feels awkward and hard to read.

However, it's good to have options, I wonder if prql will get traction. It would be great if it did.

7952 · on April 22, 2022

A lot of the time people's mental model will be derived from the language though.

ErwinSmout · on April 22, 2022

And in the case of SQL, that is precisely the problem.

zmxz · on April 29, 2022

So the problem is that SQL can be picked up fairly easily by English speaking population?

How's that the problem and not advantage?

ErwinSmout · on May 1, 2022

Because "speaking SQL" [even before SQL as such was even invented, hence the scare quotes] was intended as a skill to be practiced only by those who also grasped the [mathematical] logic of what they were doing when "speaking SQL". None of the people who "speak SQL" these days fit that bill.

And FWIW, what I wrote was in reply to "mental model". It was *you* who apparently equated that with "speaking English". The "mental model" I described as "exactly the problem" is the "mental model" *of the corresponding logic*, which, though mathematically consistent, is hopelessly unintuitive. Being English-speaking or not has nothing to do with understanding a particular logic underpinning the behaviour of a particular Data Manipulation Language.

zmxz · on May 3, 2022

Having a language that already educated and skilled group of people can pick up fast is advantage. Targeting and working with people who grasp logic is good. Not bad. There's ever-repeating notion present that we need to invent tools that people, who are far from skilled, should be able to use to produce high quality results. Having a tool like SQL that's perfectly usable for a group of highly skilled people is excellent, not bad.

> None of the people who "speak SQL" these days fit that bill.

You can't prove this. This is what you believe is true.

> And FWIW, what I wrote was in reply to "mental model". It was you who apparently equated that with "speaking English"

Isn't it odd how we have the gift of being able to communicate, yet all we do is think we can wield telepathy and not use words to express ourselves. Unless you accurately express what it is you're thinking about, I can't fill in the gaps since I'm not telepathically enabled.

> Being English-speaking or not has nothing to do with understanding a particular logic underpinning the behaviour of a particular Data Manipulation Language.

Of course it does.

ako · on April 22, 2022

Nice example of: always begin with the end in mind - first specify what data you want to see and then specify where it should come from.

I think it helps you focus on the outcome required, and makes you more efficient deciding how to implement it.

I use the same approach for methods in normal code: first decide on the signature of the method, and potentially a unit test to test it, and then think about the implementation.

tehlike · on April 22, 2022

I like to think of data transformation a chaining problem. Linq lambda syntax is kind of like that.

db.Table.where(x=> x.name == 'abc').skip(10).take(20).groupBy()

This is really hard in sql especially with temporary tables and nesting you need to do as soon as you need to do things a little out of order...

Etc

ErwinSmout · on April 22, 2022

That's because the true foundation is that relational algebra is (a) an algebra and (b) closed over relations. (a) means you should have an expressions system that allows arbitrary nesting and (b) means you should be able to arbitrarily nest relational expressions in particular. SQL fails heavily on (b) (it fails less so than it used to do with the SQL:1992 standard but even so, it's still far from where we should have been and SQL itself is still an abomination).

Moreover, your very examples illustrate how "thinking SQL" is *NOT* the path to finding proper solutions. .skip() and .take() clearly originate from the world of ranking problems and it drips off of every word that you were thinking of finding a "modern" way to specify [things like] "TOP 10", but those solutions are themselves a fundamentally crippled way to attack the problem.

tehlike · on April 22, 2022

Some discussion on prql https://news.ycombinator.com/item?id=30060784#30062329

userbinator · on April 22, 2022

It was probably influenced by COBOL, i.e. designed to read like English.

manigandham · on April 22, 2022

In fact, SQL (structured query language) was originally called SEQUEL (structured English query language).

tehlike · on April 22, 2022

I find prql more English for some reason...

ErwinSmout · on April 23, 2022

It was probably also influenced by Codd's own preference for the data language in calculus-based style (Codd's roots as a mathematician are to "blame" here), as opposed to algebra-based (the style typically preferred by developers).

summarity · on April 22, 2022

KQL is a more LINQy language.

tehlike · on April 22, 2022

Looks interesting.

eatonphil · on April 21, 2022

I wrote a similar post recently and not all the libraries overlap interestingly. My post includes sample code for each library and evaluates a few various complexity SELECT queries for correctness.

I may have to write a second post now to include some from this one that I missed!

https://datastation.multiprocess.io/blog/2022-04-11-sql-pars...

1egg0myegg0 · on April 21, 2022

Since DuckDB is so portable, it can be a good choice for adding a SQL parser to a new system. Have a look here for an example that uses the parser separately from the execution engine: https://github.com/duckdb/duckdb/blob/master/examples/standa...

eatonphil · on April 21, 2022

Duckdb itself uses a variation of pganalyze's library-ification of postgres's parser so this seems a little indirect.

That said, I've heard the duckdb folks have improved on the postgres parser a bit.

throwaway858 · on April 22, 2022

There is also a Haskell port of the original PostgreSQL parser, implemented using megaparsec, which makes it highly flexible and hackable: https://github.com/nikita-volkov/hasql-th#implementation

m-novikov · on April 22, 2022

There are also multiple ongoing projects attempting to implement incremental parsers(1, 2, 3) e.g. for tree-sitter. To make editing SQL code more convenient.

[1] https://github.com/m-novikov/tree-sitter-sql

[2] https://github.com/DerekStride/tree-sitter-sql

[3] https://github.com/Snowflake-Labs/lezer-snowsql

cm2187 · on April 22, 2022

I believe antlr has grammars for other sql syntaxes (sqlite, tsql, etc): https://github.com/antlr/grammars-v4/tree/master/sql

userbinator · on April 22, 2022

Much like with other things, recursive-descent is probably the best way to write a SQL parser as it's easy to debug, modify, and extend. There's some great hyperlinked grammars at https://ronsavage.github.io/SQL/ for those who want to try writing one.

Also, is it just me or does the text of this article have an almost SEO-spam-like "texture" to it? It reads very unnaturally.

lmwnshn · on April 22, 2022

On the other hand, you're probably writing your SQL parser to parse SQL queries that are being fed into some DBMS. I found pg_query's argument convincing: "Our conclusion: The only way to correctly parse all valid SQL queries that PostgreSQL understands, now and in the future, is to use PostgreSQL itself." [0]

[0] https://pganalyze.com/blog/parse-postgresql-queries-in-ruby

conaclos · on April 22, 2022

SQL is a mess. It is difficult to find a parser that fulfills your needs.

A month ago, I needed a parser to parse SQLite schemas. SQLite has some edge case and I needed a lossless parsing to detect these edge cases.

I finally wrote my own parser [1] to fullfill my needs and to have some fun. This is for a POC project.

[1] https://github.com/coast-team/sqlschm

cryptonector · on April 22, 2022

If I needed to parse SQLite3 schemas, I might just fork the Lemon grammar for SQLite3 from SQLite3 and make it produce an AST.

anitil · on April 22, 2022

I'd be interested to hear people's use cases for parsing SQL. The link talks about exploring sql history, but has anyone else got some interesting uses?

A couple of times where I've needed to parse SQL I would typically write a module for sqlite3 and get it to do the parsing for me. But annoyingly I can't remember _why_ I did this or what I was trying to achieve.

lmwnshn · on April 22, 2022

In Andy Pavlo's current seminar class 15-799 [0], everyone parsed SQL to extract which columns were being accessed in a workload's queries. This was used to build an automatic index tuning tool.

[0] https://15799.courses.cs.cmu.edu/spring2022/project1.html

crorella · on April 22, 2022

haha, we did just that in my company, but aimed at distributed tables (like HDFS) for their partition and bucketing schemata.

pbowyer · on April 22, 2022

A couple of years ago I inherited an old, badly written web app doing up to 800 queries per page. It was slow and hammering the server.

Being read-heavy the app was an ideal candidate for output caching. But - there were no hooks in the admin code to add cache invalidation. And I wasn't going to crawl through 10s of thousands of badly written lines and add cache invalidation calls manually, because I would miss some.

So I hooked into the database layer, parsed the SQL queries and extracted the table names. The read-heavy pages on the frontend were tagged in the cache with the names of the tables they read data from. In the backend, I'd collect the table names in all the write SQL queries and then clear the cache that was tagged with these table names.

Working at the table level rather than row level it cleared more data than was needed, but it was simple and effective.

It worked really well but never went into production - in the end we forced a rewrite of the app. One day I'd like to revisit the idea.

anitil · on April 27, 2022

That is a fantastic use case! I'd have never have thought of doing this

i_s · on April 22, 2022

At my work, we often parse and rewrite a query before handing it off to SQL Server, because there are a lot of cases where Microsoft misses obvious optimizations. Sometimes there are also optimizations we can do because of things we know at compile time, but don't fit in the type system of SQL. The impact varies all the way from just shaving off 10% of the execution time, to changing some queries from timing out in a web request to executing in a few hundred milliseconds.

A few examples:

- Inlining scalar function calls (less impactful now with Sql Server 2019)

- Removing joins from a query when we know it won't impact the number of records

- Deepening where conditions against derived tables

- Killing "branches" of union queries when they can be determined to not matter statically

Yes, we could write the queries that way in the first place, but it would make them harder to compose, more verbose, and harder for the programmer to communicate intent. Not to mention, often the user can impact what the query will be, making it less feasible.

zasdffaa · on April 22, 2022

Hmm? This doesn't make much sense. MSSQL does these already I think (except the first)

.

- Inlining scalar function calls (less impactful now with Sql Server 2019)

write them as table valued funcs and it'll inline them for you (ugly but it works and is easy)

.

- Removing joins from a query when we know it won't impact the number of records

That's just tree pruning. It does that

.

- Deepening where conditions against derived tables

If I understand you that's predicate pushdown. MSSQL does it well.

.

- Killing "branches" of union queries when they can be determined to not matter statically

Not sure what you mean. Can you give an example?

i_s · on April 22, 2022

> This doesn't make much sense. MSSQL does these already I think (except the first)

We tested all of these before taking the time to do the rewriting, and no, either they don't, or they way they did it isn't good enough. You can say I'm lying if you want, I'm not going to spend time arguing about it.

zasdffaa · on April 23, 2022

You could post some working examples and show them failing.

lmwnshn · on April 22, 2022

> less impactful now with Sql Server 2019

Could I ask what your process is for detecting whether a rewrite rule is still useful in subsequent versions of the DBMS? Do you read the release notes and test things out manually, or do you have an automated A/B test thing going on?

Additionally, have you ever had a rewrite rule change from being beneficial to being detrimental after upgrading versions? If so, how did you detect that?

Thanks!

Edit: Also, what considerations do you have for rewrite rule order? Do you find that it makes a significant difference in practice?

i_s · on April 22, 2022

To check if rewrite rules are still helpful, yea - we just read release notes and test manually.

We've never had a rule change from being beneficial to detrimental. For most of them, I don't think that would be possible, because they just involve giving Sql less irrelevant things to chew on. For a few of them, like the manual scalar function inlining, I could see that being possible, so we will just need to keep checking.

For rewrite rule order, I guess we just do the ones that can enable other optimizations first. So far, that has been pretty simple to determine. For example, when we trim joins from inner queries, we first trim their selections (depending on what is actually used in the outer queries).

zasdffaa · on April 22, 2022

> We've never had a rule change from being beneficial to detrimental

Whatever you're doing is really confusing me (see my prev post). A multi-thousand line piece of the SQL I wrote was tested on mssql 2019 and there was a blatant perf. fuckup from it's previous home on mssql 2016 (or was it 2014). Poss. down to the new cardinality estimator, I dunno.

> For example, when we trim joins from inner queries, we first trim their selections (depending on what is actually used in the outer queries).

If you looked at the query plans you will see this happens automatically. And very reliably because it is easy (indeed, quite straightforward) to do automatically.

Also... 'we just read release notes' - these optimisations are not documented (except maybe in one place which they carefully undocumented after the 1st release) because these are trade secrets. The optimiser is one of the most important and carefully guarded parts of mssql - it's not in the notes and never will be.

No offence but... seriously, wat???

lmwnshn · on April 22, 2022

Thanks! Super interesting.

eatonphil · on April 22, 2022

Building a SQL IDE and want to do introspection on queries. Building a SQL frontend for some API or CLI. Building a SQL-aware proxy. Just a few ideas. :)

achanda358 · on April 22, 2022

Some more: SQL linter, alternate DB implementations (Cockroach DB, TiDB etc), SQL optimisers

anitil · on April 27, 2022

Interesting. Out of interest, what would an SQL-aware proxy give you? On-the-fly rewrites? Automatic caching?

eatonphil · on April 27, 2022

Have you heard of Vitess? Or the company Planet Scale? Vitess is a SQL-aware proxy that helps you scale MySQL.

https://vitess.io/docs/13.0/overview/whatisvitess/

dmfay · on April 22, 2022

I'm using a tree-sitter grammar^1, which category the OP doesn't mention, to index database object references in data access code and process schema migrations^2. The idea is early detection of potentially-dangerous database changes that modify or drop tables/views still used elsewhere, across the entire organization's code. It's already saved my bacon a few times.

^1 specifically https://github.com/DerekStride/tree-sitter-sql , but there are a few others around too

^2 https://ectomigo.com

crorella · on April 22, 2022

We created a custom purpose parser to get some interesting facts about the SQL statements running in our warehouse, a couple of millions of queries per day. With that information, we created a tool that finds the best partition and bucketing schema for each table, based on how the query patterns of downstream pipelines access them.

Since some of the tables are multi-petabyte with thousand of downstream consumers, the savings have been in the millions of USD, because of CPU savings mostly, but also there is the benefit of improved wall time and data arriving much earlier to dashboards and reports.

lmwnshn · on April 22, 2022

Would you be able to provide examples of interesting facts, and is this tool (or anything similar) open-source or described publicly? Thanks!

crorella · on April 22, 2022

Interesting facts about the tool, we created the parser with speed and efficiency in mind, it is able to process ~5M queries in less than an hour, since many of the SQL statements are associated with scheduled pipelines, I came up with a way of normalizing them and generating a signature we use to skip them if they have not changed, that saved a lot of processing when parsing.

We also created some other datasets that tell you how the tables are normally joined and another team created a ML model that now we use in an internal tool that automatically recommends the best join keys when you join 2 or more tables (since we have lots of historical info on that already stored).

We also had to create a very efficient table profiler to evaluate the candidates, because in some cases there are columns that are widely used in equi-where conditions, but their cardinality is very high, making them bad partition columns.

One guy created a parser that actually gets the most common values used to filter each column, I guess we could use that in the future to materialize some views; my original vision of the project was to continue with partial aggregations for common computations. The thing is there are several pieces of code that are pretty much copy/pasted and reused in many pipelines, so why not materialize those and rewrite the SQL of the subsequent pipelines to leverage the materialized version? huge savings there.

We are planning to present it in VLDB or a similar forum, there are some aspects of it that we need to 'clean' if we want to open source it.

Other parts of the system include the candidate evaluation and the module that computes the expected savings for the best candidate selected during evaluation; this system in particular has a lot of specific Presto and Spark logic that might need to get more general if we want to open source it.

lmwnshn · on April 22, 2022

Thanks for the detailed response, looking forward to the VLDB paper when it happens! Your vision sounds cool. I wonder if you could get most of the way there by exposing the workload (across different pipeline stages) to a materialized view recommender.

In the class project mentioned elsewhere, I found normalizing queries to be pretty slow in practice (naive standardized formatting + query templatization, tried various Python libraries, settled on pglast). I didn't think about trying "skip if fingerprint matches", which may help considerably. Fast normalization is nice! :)

crorella · on April 22, 2022

> I wonder if you could get most of the way there by exposing the workload (across different pipeline stages) to a materialized view recommender

yes! that's something we are trying to do, since we have a way to create signatures for SQL statements and subqueries are just SQL statements then we can get all the "signatures" a query use and compare if other queries are using the same signatures. Then just sort those queries by number of times used and put some other perf metrics like IO/CPU needed to compute it and you get a good starting point.

Microsoft did something similar with Azure, using bipartite graphs, their solution was more advanced as they also baked in constraints like "the materialized view can't be more than X GB in size" but the end result is the same. (https://www.microsoft.com/en-us/research/uploads/prod/2018/0...)

cryptonector · on April 22, 2022

Here's a few:

  - syntax highlighting
  - algebraic manipulations
  - optimization analysis of queries seen
  - analysis of equivalence of queries
  - query rewriting (for style, performance, etc.)
  - RDBMS portability layer (implement a common
    subset of SQL, port queries to different
    engines; see previous item)

I can probably think of more.

madebyherzblut · on April 22, 2022

Slightly off-topic, but I appreciate any pointers: In a project, I want to offer an SQL interface for data analysts similar to Stripe Sigma[1]. The tricky thing is not to parse the query but to map the public schema to the private schema, add authorization, and distribute the query across data stores. So, I am looking for a customizable query parser, planner, and execution engine.

I briefly looked into Apache Calcite and DataFusion[2], but I am unsure if I am on the right track. If someone has any ideas about where to look, please let me know.

[1] https://stripe.com/en-gb/sigma [2] https://github.com/apache/arrow-datafusion

haliax · on April 22, 2022

For a similar use case, I’ve been considering a combination of s3 exports of db views (mapping private schema to public) + lakeformation governed tables (which allow table, row, and cell level security via iam) + redshift or athena for querying + sts/cognito for authorization to give logged in users a temporary access key id and secret access key. Admittedly an AWS heavy setup, but in my use case that’s an advantage :)

zasdffaa · on April 22, 2022

> So, I am looking for a customizable query parser, planner, and execution engine

That looks like an entire DB engine, isn't it?

> map the public schema to the private schema

sounds like view.

It seems like a pretty major project as you describe it.

ErwinSmout · on April 22, 2022

And if you're set on building an "entire DB engine" then you can just as well go the full mile and go for relational. As opposed to SQL.

zasdffaa · on April 22, 2022

I know about TTM etc, but what business value would that bring to be fully relational? Specifically, thanks.

ErwinSmout · on April 22, 2022

See the "business case for SIRA_PRISE". Imagine how many years of codeshitter-hours you [or, if you're in the DBMS market, your customers] would no longer have to pay for if you could have *ALL* of your [strictly data-related] business rules enforced by a mere *declaration* made by the business analyst c.q. the data administrator (note the absence of 'base' in that term), as opposed to having to rely on those very same codeshitters because there simply ain't no other way for you to enforce same set of said business rules other than to pay that mob for coding all that application-enforced integrity (which they are bound to get wrong because it is fundamentally beyond their ability) or the stored procedures that achieve the same but are still procedural (and is still bound to fail for essentially the same reason).

If you know about TTM, you might also know about "Applied Mathematics for Database Professionals". What I'm referring to is their "execution model 6" becoming possible *WITHOUT* any [procedural form of] coding [by mere programmers].

Another way of saying this is "You can have CREATE ASSERTION if you want to".

zasdffaa · on April 22, 2022

Am aware of sira-prise but not very well. Thanks, will read up (any disclaimer needed here; has sira-prise any link to you?)

I'd appreciate you omitting the codeshitter ad-homs, it undermines your case.

Pretty sure no declarative statement can be made efficient automatically so that remains a dream (though one I will need to look at) so it will kill performance. I too hate procedural enforcements but there seems to be no way round them. Have you got a reliable statement anywhere that says efficient 'create assertion' in sql is possible in general?

AMfDbP book - it's on my reading list already. Thanks for the pointer.

ErwinSmout · on April 22, 2022

Link between me and SIRA_PRISE : I am the author.

codeshitter ad-homs : yeah well I know they are. The fact of the matter is the history between SIRA_PRISE and me (and why I did it in the first place) is now almost 20 yrs old, and I know how it's been received, and that's primarily due to (a) how the codeshitters (and the way how they are subject to the Dunning-Kruger effect) have come to dominate the entire industry and (b) how that economic system I too am bound to operate/survive in prevents the managers who control the system from taking "too much" (say, > 0.01%) risks.

And about "Pretty sure no declarative statement can be made efficient automatically so that remains a dream" : there fucking sure ain't no better way to piss me off, not just to the other side of the planet, but to plain outright Mars or Jupiter. You're just too pretty damn sure of yourself. AM4DP makes +- the same statement as you although the authors there still did manage to *NOT* make the same logically flawed inference from "not anybody knowing today how it can be done" to "cannot be done". You apparently fall into the category of people who do make that inference. The article exposing the exact opposite of your convictions was published in Oracle Users Magazine somewhere around 2013, IIRC. I'd need to look up the exact details by now, but it was already an incredible surprise Oracle User Group even wanted to publish on the solution of a problem that Oracle Corporation itself wasn'y (and still isn't, probably due to the power of certain people within that company who have been labeled 'bean keepers') prepared to invest in.

In fact, the only reason I wrote that article in the first place was Toon Koppelaars using that very word 'dream' (which you also used in your reply) to associate the idea with SQL's CREATE ASSERTION, when I already knew it needn't be a 'dream' no longer ... Well, if anyone just wanted to believe, which they don't, you inluded, apparently ... To Toon's credit, although he also used the adjective 'impossible' to qualify that 'dream', he did manage to end his title phrase with a *question* mark. You ended yours with an exclamation mark.

And just for the record, this statement is *NOT* to be interpreted as "every thinkable constraint can be implemented with sub-millisecond violation detection". Sunt certi denique fines, quos ultra citraque nequit consistere rectum. The "fines" in case being the bounds of what algorithmics as it is known today, can do for us. Constraints like "all the people who obtained a degree in mathematics must be paid in the top 5% of salaries" (if anyone ever wanted to formally declare and enforce any rule like that) inevitably takes us into realms of 2nd-order logic that no known algorithm today can guarantee us execution times like the ones we have grown accustomed to from what is supported by SQL systems these days).

As for "do you have a reliable statement" ... I have two answers but neither have the quality of being dependable on the academic level of meaning you might probably want to attach to the words "reliable" and "dependable". The first answer is "No, because the only other person in the entire world that I'm aware of achieving results in the same area is professor Davide Martinenghi and his PhD thesis on the subject, of which I don't even consider myself capable enough to academically assess the equivalence between his results and mine" and the second answer is "No, because that 'reliable' statement would have to be made by me myself and I have nothing(*) but my own implementation to 'prove' it and I'm very well aware that on an academic level, a seemingly working implementation is not a proof".

(*) I did write a paper on the subject and it's been seen by Chris Date (who by proxy answered that he was 'impressed'), by Hugh Darwen, and by Adrian Hudnott. And I will rather take it to my grave with me than disclose it any further.

And about the *scrutinously detailed subject* of your question whether "efficient 'create assertion' in sql is possible in general" : I have grown convinced that *IN SQL*, it isn't. But the reasons are *ENTIRELY* related to SQL's depending on 3VL. I have grown convinced (in fact, for my own conviction, I think I have sufficiently demonstrated) that in some *other* DMl language that *embraces 2VL*, it is *perfectly feasible*. And in fact, I think SIRA_PRISE itself is its own proof, in that respect : *ALL* the rules that SIRA_PRISE imposes as business rules on its users are implemented as mere declared database constraints on the catalog, and are enforced using the *exact* same machinery that also enforces the *user* declared business rules on the *user* databases. When I showed my system to a former colleague of mine who made his entire career in data [base] administration backed by a degree in mathematics, he merely responded that there just ain't no better POC than that.

zasdffaa · on April 23, 2022

Jesus. I feel there may be a touch of dunning-kruger about you. From that book (page xvi)

" Standard SQL is relationally complete in its support for declarative constraints by permitting the inclusion of query expressions in the CHECK clause of a constraint declaration.

We do not yet know—and it is an important and interesting research topic—how to do the kind of optimization that would be needed for the DBMS to work out efficient evaluation strategies along the lines of the authors’ custom-written solutions. "

So it can't. But then you say.

> But the reasons are ENTIRELY related to SQL's depending on 3VL.

Easy to fix. Simply require the assertion to be defined on tables (ok, ok, relvars) with 'not null' on every column, also require a PK to ensure uniqueness, and you're away. Except you aren't.

I'm not going to argue with you. Your knowledge is clearly considerable but it comes with something extra I don't need.

ErwinSmout · on April 23, 2022

The basic text of the book was written before Lex died (must have been +- 2006), my work in which I cracked that problem saw the light of day after that time. So "we do not yet know ..." was technically correct at the time they wrote it, it no longer would be now.

That other snippet "Standard SQL is relationally complete ... CHECK clause ..." is technically correct, but the standard allows subqueries referencing other tables than the one the CHECK clause is on, but as far as I'm aware no product supports that (and the ones that do leave the user exposed to risk of faulty behaviour). Sadly, such a feature is necessary if we'd want to write, say, an FK constraint in the form of CHECK clause on the referencing table : CHECK (EXISTS (SELECT 1 FROM PARENT WHERE <FK equality tests here> ) ).

ErwinSmout · on April 28, 2022

With one single exception : you cannot use a CHECK clause to check that a table must be nonempty, because the semantics of the CHECK clause are that it must be satisfied by all rows in the table, and therefore the empty table trivially satisfies any CHECK clause.

zasdffaa · on April 30, 2022

Huh! That makes sense, thanks! It's standard univ. quantifier behaviour, I hadn't made the link.

Also that's bloody odd because date or darwen (or both) explicitly called out this behaviour as flawed in SQL when in fact it's obviously logically consistent, and they know logic and SQL, so... why are they critising it? How very strange.

Again, thanks for pointing this out. How did I never realise this?

ErwinSmout · on May 1, 2022

What they "called out as flawed" was the impossibility to declare any such rule with SQL as it stood. What they were "criticizing" was precisely the fact that a data language that was supposed to be "expressively complete" was *unable* to express such a rule. You can't claim "expressive completeness" for a language if there is demonstrably a case where said language's expressiveness fails to meet the mark. It's not the "behaviour of the language" they were criticizing in all/any of those use cases that were supported by the language as it stood. They were criticizing the fact that there was a use case [and a relatively reasonable one on the face of it] that the language couldn't support.

cgio · on April 22, 2022

Harvesting lineage information could be a good use case.

crorella · on April 22, 2022

Yes! you can then put a UI on top of that and answer stuff like "who will get affected by my change on this table?" or.. "there is a bug in my pipeline and now I need to notify downstream consumers"

anitil · on April 27, 2022

Where can I sign up?

anitil · on April 27, 2022

This would be so useful to me right now. So often I'm flying blind making edits and just praying I don't ruin someone else's day

gunnarmorling · on April 22, 2022

For Debezium (a change data capture solution) we parse SQL for some of the connectors in order to keep track of the schema of captured tables as DDL statements are executed.

Events we receive e.g. from MySQL aren't fully self-descriptive, you need to know the schema they adhere to in order to interpret them. As table schemas can change over time and as -- due to the async nature of streaming the transaction log -- we may receive change events produced from before a table's schema was changed, we cannot simply query the current table schema. Parsing incoming DDL events is the only way for keeping the connector's view on the schema in sync.

technion · on April 22, 2022

I'm still waiting for a very simple proxy where I can write rules like "DELETE statements must come with a WHERE".

orangepanda · on April 22, 2022

Oh this doesnt need a condition, let me just WHERE 1=1

technion · on April 22, 2022

Yes, but deliberate malfeasance isn't the usual problem when this occurs, as much as some variant of "the BI person hit run before the query was finished".

anitil · on April 27, 2022

Maybe you could add that the where condition needs the name of the authorising user? Start with "WHERE stmt AND AUTHORIZED_BY='Anitil'" then strip the AUTHORIZED_BY. If they're smart enough to do "WHERE 1=1" they might be smart enough to realise that signing off on it is a bad idea

matteote · on April 22, 2022

I use ANTLR4 and its rewriter to translate between two similar SQL dialects, for migration purposes.

newusertoday · on April 22, 2022

auto generating api for frontend and backend that uses sql query and metadata.

password4321 · on April 22, 2022

Yes! Crossing the boundary between code + DB allows things like showing "this url could read/write these table columns", down to "hover over data to see the source" levels. This can track what has been accessed in the past year vs. no longer used, etc.

You can also reverse engineer missing foreign key relationships by parsing all of an apps' stored procedures (for relationship diagrams, etc.). Or just what external entities are referenced from "other" databases to get an idea of minimal schemas needed for automated testing.

bogomipz · on April 22, 2022

>"Parsing SQL queries provides superpowers for monitoring data health"

Can someone say how this works exactly? What does parsing offer that things like logs and other metrics don't in terms of monitoring?