The syntax comparison section will likely sour a lot of viewers who already know SQL. You try too hard to highlight how easy and terse PRQL is by putting comma-separated items on their own lines but in SQL you put each item on separate lines.
It may be typical of many SQL users and formatters, but it leaves a poor taste in the mouth that you aren't interested in an actual comparison but in marketing.
For those who already know SQL, the real question is: will it make my queries faster? Putting the FROM first isn't sufficiently compelling on its own. Having a processing pipeline, though marginally more elegant to look at, doesn't actually improve upon CTEs.
When you say you can use it with any database, how do you handle functions, stored procedures, jsonpath, and the massive differences in functionality between Oracle, MS SQL Server, Postgres, DB2, MySQL, MariaDB, H2, SQLite, etc.? Lowest common denominator?
After 49 years of SQL, more than syntax has to change; you need an engine that supports this natively and can actually improve planner behavior over existing engines.
I will grant that if you are limiting your target audience to primarily analytics, it's probably sufficient. The marketing of PRQL doesn't always appear to do this however.
Yeah the syntax comparison is deliberately misleading.
They style it as "4 lines vs 10 lines!" when it's actually 4 lines vs 4 lines.
# PRQL
from employees
select {id, first_name, age}
sort age
take 10
# Misleading SQL
SELECT
id,
first_name,
age
FROM
employees
ORDER BY
age
LIMIT
10
# Actual SQL
SELECT id, first_name, age
FROM employees
ORDER BY age
LIMIT 10
The join example is similarly deceptive:
# PRQL
from employees
join b=benefits (==employee_id)
join side:left p=positions (p.id==employees.employee_id)
select {employees.employee_id, p.role, b.vision_coverage}
# Misleading SQL
SELECT
employees.employee_id,
p.role,
b.vision_coverage
FROM
employees
JOIN benefits AS b ON employees.employee_id = b.employee_id
LEFT JOIN positions AS p ON p.id = employees.employee_id
# Actual SQL
SELECT employees.employee_id, p.role, b.vision_coverage
FROM employees
JOIN benefits b USING employee_id
LEFT JOIN positions p USING employee_id
It is also (for me) quite a bit noisier and less readable than SQL. The join example has multiple different syntactic symbols, for no clear intuitive reason (= vs ==, {} vs (), side:left vs p=positions).
I actually like it; it makes sense at a glance, without having read a single line of the official documentation. But I think people should be free to use either syntax -- method or query -- as they see fit, without judgment of which one is better.
I strongly think we should have the best examples of SQL to compare against. I've ironically made this complaint for other libraries, so I'm alarmed that folks think we might have done the same.
We would take a PR for the first example if folks think that's better spacing. (I think the spacing is the only difference?)
For the second — `USING` isn't fully equivalent to `ON`. There are discussions on GH (https://github.com/PRQL/prql/issues/723) as to which we should compile to. In the meantime, we'd take a PR on anything that is equivalent.
> For the second — `USING` isn’t fully equivalent to `ON`.
In terms of portability because its not supported by, e.g., MSSQL, sure.
In terms of its semantics, though:
t1 [LEFT/RIGHT/INNER] JOIN t2 USING col
Is fully equivalent to:
t1 [LEFT/RIGHT/INNER] JOIN t2 ON (t1.col == t2.col)
So for a comparison to SQL as used by most RDBMSs (rather than MSSQL specifically), rather than “what should PRQL compile to”, USING is quite appropriate.
It may be that the intent of the homepage pairing is to highlight the actual compilation result and not provide a comparison to SQL-as-it-would-manually-be-written, but the presentation doesn’t make it clear that that’s the purpose.
USING just presumes the columns have the same name on both ends. There is no logical "combining", only a lexical one. It's why many DB admins/engineers use "foo_id" instead of "id" in their schemas as a rule. (Notably, also one of the reasons why many DB admins/engineers hate ORMs and other table generators that name id columns generically as "id".)
The person you are responding to is likely in this cohort. You're correct that they aren't exactly equivalent, but that's an artifact of your table definition, not the language. It likely seemed perfectly reasonable to the commenter that this naming was a trivial detail in the scope of the greater conversation.
WITH t AS (SELECT 1 x)
SELECT x FROM t JOIN t u ...;
That breaks if you fill in the ... with "ON t.x = u.x" because there are two columns called "x", but works with "USING (x)" because they get collapsed into a single column. I think it makes sense to say those aren't fully equivalent.
> We would take a PR for the first example if folks think that's better spacing. ... we'd take a PR on anything that is equivalent.
"We'd take a PR" is a great line for someone who's already sold on the concept, but it's not super helpful for flaws in your marketing. Reading OP, it doesn't sound like they trust you enough to find it to be worth their time fixing your examples for you.
Though I'm not sure how to find consensus on what is the best representation of SQL without asking folks to make a reviewable suggestion.
I had thought the examples are currently in their clearest state — I definitely don't think it's obvious that removing line-breaks makes it look clearer (which is the full extent of the feedback IIUC...)
Also there's no need to capitalise, so the 4 lines vs. 4 lines is actually just 'order by & limit' vs. 'sort & take'. NBD really, but it's probably nicer in more complex examples.
Then your experience doesn't extend to dynamic queries composed of query fragments, which is almost universal to all e-commerce sites.
Views are a degenerate, second-class form of query abstraction, as are common table expressions and the numerous other features added to SQL over the years to paper over the lack of proper first-class query abstractions.
It's like saying a programming language doesn't need first-class functions because we have disjoint unions and we can perform defunctionalization by hand. Strictly true, but really missing the point.
You can in Postgres. It's not often a good idea since it would constitute an optimization fence and potentially buffer large values in memory, but you can absolutely do this in a function variable.
record[]
some_table_name[]
jsonb containing an array of records
And of course the most obvious shared set variable of all, supported by every popular RDBMS in existence: the temporary table. Something scoped to the current session, of essentially unlimited width and length, and accessible to other definitions and queries.
We get it. You want something closer to a general purpose programming language. But don't confuse your preferences with actual missing features.
Temp tables are another bandaid. Like views, they are another second-class abstraction intended to address the lack of generality of relations. As you hint, adding second-class features is intended to handle pain points while keeping optimization simple, but it's the wrong way to do it IMO. No fixed number of second class features can make up for the lack of relations as first-class values, so this is not just a matter of preference.
I actually don't want a general purpose programming language as that's not suitable for data access, I just want a query language that doesn't impose artificial limitations on relations.
Edit: although I will admit that programming language theory is a hobby, so that's why I understand the source of the pain every time I have to use SQL. LINQ almost gets this right, but is still a huge improvement on SQL.
Does it SELECT * by default if I never define a SELECT below my FROM? ... Continuing to encourage folks by allowing them to SELECT * easier is would not be fun for me... I could be wrong?
Agreed, just parsing out the formatting so its "fewer lines" than traditional SQL soured me.
The expressions example is ridiculous, in Redshift I can do this all day??
SELECT 1 + 2 AS num1
, num1 * 2 AS num2 -- Literally no difference
I know SQL, and I imagine the authors of PRQL know it better than I do.
Doesn't it seem weird that dozens of application languages have become popular since the 1970s, but we're still using dialects of the same old database query language? If it had a really elegant syntax, perhaps it wouldn't, but SQL's syntax is anything but. Some of the semantics can be awkward as well.
I, for one welcome attempts to move things forward (which is different from saying I'm going to run out and use PRQL in production tomorrow).
> Doesn't it seem weird that dozens of application languages have become popular since the 1970s, but we're still using dialects of the same old database query language?
Indeed. Do you honestly believe that a half-century of data storage professionals and vendors are blindly moving forward with a hobbled tool?
Or maybe there are aspects of SQL as a set-oriented 4th generation programming language that aren't apparent to folks who are intimately tied to an imperative or functional programming paradigm as opposed to a declarative DSL for set theory.
Within popular application languages, for any given paradigm, there are almost always several languages that aren't merely dialects of each other. C# isn't a dialect of Java. Ruby isn't a dialect of Python. Rust isn't a dialect of C++.
PRQL demonstrates that a set-oriented declarative language need not be a dialect of SQL and isn't the first language to do so (QUEL appeared in the 1970s). It seems odd to me these alternatives haven't gained much popularity.
> Putting the FROM first isn't sufficiently compelling on its own.
Personally I see that as not even neutral, it's a downside. Optimizing for autocomplete is an antipattern, code is read far more often that it's written and the SELECT clause is the interface to the following code. It should be easy to find when skimming, not buried in the query.
The SELECT clause is also akin to an assignment and it's extremely rare I see anyone advocating flipping the order of those to match what they say they want in SQL.
Edit: Since I'm sure someone is going to jump on it, yes, I'm conflicted about the WITH clause: It's extremely useful and I like what it does, so I do use it, but I don't like where it's positioned. I've been toying with indentation to work around it so SELECT is still just as visible as otherwise.
You might find Malloy interesting as it makes a greater departure from SQL syntax. Queries are first class objects which can be nested within each other in order to do trellising. It still compiles to SQL because that is the only language accepted by DBMSs today; however it will automatically write symmetric aggregate calculations and do those nestings that are hard for a human to write.
I strongly think we should have the best examples of SQL to compare against. I've ironically made this complaint for other libraries, so I'm alarmed that folks think we might have done the same.
We would take PRs for any improvements to the SQL that make it a better comparison.
The point of being able to write "FROM" before "SELECT" has become moot to me since many DBs support that SQL "convention" already.
The newline/indentation is something I would do for readability in the PRQL too.
At the end of the day, I'm becoming less concerned about these various query language syntaxes and more concerned about the logical and physical plans generated by the respective query engines.
In my experience, what ends up most problematic is that each query-engine/optimizer sometimes/usually requires SQL tweaks/nuances to perform as efficiently as possible. That's where reading/writing/maintaining queries can get really confusing/difficult. An abstraction such as PRQL, dataframes, etc might add value here, but only if the abstraction (or runtime/evn built around it) can tune the output SQL to each query engine as it is needed.
There are a lot of projects using frameworks/libraries such as Substrait and SQLGlot to accomplish this.
It does make a difference if the resulting SQL is unperformant.
Someone who is good at SQL can look at a query and see where the query planner might go wrong, then make subtle tweaks to get better performance. Optimizing queries in a compile-to-SQL language basically has to be left to the compiler, which may not have the context needed to write performant SQL.
Compiling a query is different than compiling a whole program to assembly, because the query makes assumptions about schemas and indexes that cannot be encapsulated in what you're feeding to the compiler at that moment.
So I agree that unperformant SQL is unperformant, but PRQL can _reduce_ the chance of making mistakes there.
Here's an example [1] of someone reporting that a query engine was far more performant with one SQL construction, and then PRQL changing the SQL we output to use that construction.
GCC & Clang are much better at compiling to assembly than any person! PRQL isn't there yet, but each improvement scales to everyone who uses it.
I don't know where this aggression is coming from — we're an open-source project, only volunteers, trying to do something that makes it easier for folks to work with data.
Critical & constructive feedback would be really appreciated, but that's not this — why all the snark?
Agreed. Got off on the wrong foot. I and others have noted what appeared to be overzealous marketing, and I previously asked questions regarding DB engine-specific features like jsonpath, unnest for arrays, etc.
The responses from the PQRL appeared evasive to me, asking for PRs to address obvious syntax mismatches, and never answering (what I believed to be) substantive questions regarding extensibility and lowest common SQL denominators. "We don't support that yet," is a perfectly reasonable response. But what I'm seeing now is just youthful exuberance untempered by abundance of experience.
Almost anyone working intimately with DBs beyond just MySQL 5.x will know that each DB engine has features that can reduce the amount of SQL required, improve data integrity, substantially speed up operations, etc. Not allowing for escape hatches to native functionality is often a deal breaker. Does PRQL have "escape hatches" where you can use the vendor-specific keywords, operators, and types of the underlying engine? (Not just calling single-return-value function, but actual substantive syntactic differences.)
The issue of NULL is a tricky one, and I totally see why folks would want to == and != it away, but unfortunately in the SQL world, it's not just a matter of IS NULL vs IS NOT NULL (leaving aside IS [NOT] DISTINCT FROM). NULL has profound effects in the model stemming from its VERY different meaning from most programming languages. NULL means "unknown" in SQL, not just "no value" as it is in most general purpose languages. It's subtle but absolutely can't be hand-waved away. How does PRQL handle "IS DISTINCT FROM"? What about set-returning functions that were passed a NULL parameter?
Which brings me to why I think I was harsher than intended: I don't think any of you actually like SQL, and I think that's a dangerous place to start from. It leaves "weird" NULL behavior cascading through your query while the user has used == and != just as they always have with any other value. Using EXPLAIN becomes more difficult due to indirection, and it's not clear to me how you'd alter the query without hacking the PRQL engine in a one-off. I fear that it is indeed "lowest common denominator" (which these days typically means MySQL 5.x compatibility), which is (in my personal opinion) wholly unacceptable for a non-trivial data access conduit.
I'd likely have been much less snarky and supportive of PRQL if it was presented as an idea factory examining the shortcomings of SQL—of which there are many—and trying out solutions on an experimental basis.
But that isn't the tone of the project at all. Its messaging (intended or not) goes at SQL with both barrels while touting itself as a production-ready alternative standard to SQL. It isn't about "trying it out for analytics to see if it's right for your team," which I could totally get behind.
Instead, your site has statements like these:
• PRQL is a stable foundation to build on; we’re open-source and will never have a commercial product
• PRQL is a single secular standard which tools can target
A lack of commercial support isn't in any way synonymous with a stable foundation. Stability is like trust: it is earned over time through action, not just asserted to be true with good intentions. I'm not saying you can't get there eventually, but PRQL is just a year old! The project and team simply haven't earned the right to assert that yet!
All implementations are standards until they diverge. Calling PRQL a standard at this point is like the first baseball World Series where only the US played (and just white men from the US at that, but I digress). I'm not saying it can't ever be a standard, but you're skipping several steps in that process and already declaring, "Done!"
To my knowledge (and quick internet searches) there is no major standards organization behind it. Is there even more than one implementation of this standard? (Usable in multiple programming languages is not the same thing.)
I appreciate that you're volunteers and an open source project. I really do. I also appreciate that your hearts are in the right place with regard to making data accessible. But this isn't about snark. You're trying to bite off a huge chunk of a 49-year-old leviathan without recognizing how it got to be almost half a century old (in the computer industry!!!) and without noticing that massive leviathan is covered in a long lifetime of battle scars. Those are the waters you're trying to swim in (or at least seeming to from your public website messaging and source repository).
"Those who don't understand UNIX are doomed to reinvent it, poorly." – Henry Spencer (1987)
This is not just an idle piece of snark from Unix folks to any upstart that crossed their paths. It was already a large, battle-scarred leviathan as well by that point, and is far more so today!
The same totally applies to SQL. Show some respect to that, keep up the good fight, but don't let hubris cloud your judgement about how far you and your team really are in this journey.
Other commenters in this discussion have noted inconsistencies in PRQL syntax. Those conversations won't get any easier by stating you'll never address them because it's "a stable foundation" or making breaking changes to make things better, thereby breaking that promise. Your project is experimental. Own it. Run with it. Improve on it. But if it's presented as more than this—as it is now—expect vehement pushback from both haters and other who honestly hold you no ill will but recognize several missing fundamentals as I have.
I won't address every point here, but a couple of thoughts:
- To close the syntax point — I opened https://github.com/PRQL/prql/issues/3518. We're currently using the output of the rust's canonical SQL formatter. (so any claim we're obfuscating SQL's syntax is incredulous — do you agree?). I'm open to hand-written examples if SQL doesn't have sufficient auto-formatting tools available.
- "never answering (what I believed to be) substantive questions regarding extensibility and lowest common SQL denominators" — some of the features are implemented, some aren't, there are issues on GH for by-and-large all of them. Feel free to open other issues. I don't think there's a duty to respond to every question in every comment on HN, and I certainly don't think losing your composure is an appropriate response to others not answering every question.
- Many of the answers to your questions are in the docs — for example escape hatches. You don't have to read them to engage, but claims about a project's "youthful exuberance" are less credulous from those who haven't done so.
- Stepping back, are these are informed & constructive criticisms, or are they grasps for straw-men as part of a reactionary response? For example, the response takes two phrases from the website out-of-context to load a whole argument on the words "stable" & "standard". But stability & standards can each mean multiple things, and taking the least generous interpretation of a word doesn't make for a reasoned critique. Does a point such as:
touting itself as a production-ready alternative standard to SQL
...have any grounding in fact? Or does it come through this aggrieved reactionary lens? For context, PRQL's Readme specifically states:
PRQL still has some bugs and some missing features, and is probably only ready to be rolled out to non-technical teams for fairly simple queries.
Be honest. If a different project touted itself as a standard and a firm foundation on its public website but had this disclaimer on the source repo's README file, what would your thoughts be?
While you may categorize my responses as reactionary—and they very well may be—can you really claim unbiased objectivity on your part?
For example, JS and C# have ECMA. SQL, C, and C++ have ISO. POSIX has IEEE. All have multiple implementations by different organizations. The term "standard" has a clear, well-defined meaning in computing with a long history. Your response was to handwave it away as "meaning multiple things." Whether de facto or de jure, the appellation in PRQL's case simply does not apply.
I appreciate that you are not obligated to read and respond to every internet comment. When you do respond however, bear in mind that dismissing clearly defined definitions of industry terms like "stable" and "standard" is not an unbiased position.
I am not aggrieved. You asked for critical and constructive feedback. I laid out plainly where I believed the messaging was incongruous from the point of view of someone not intimately involved with PRQL and how I felt it should be changed to better fit its role in the data ecosystem. I didn't call for the project's elimination or even call into question the need for improvement over the existing state of SQL. Take of that what you will.
(I couldn't help but notice you didn't comment on the difference in formatting in the examples.)
Do you have examples of PRQL working with jsonpath? Generating JSON? Unnesting arrays? Returning ids from an INSERT or UPDATE without making a separate read query?
Thanks, that's a great question. You're right in that so far we haven't highlighted working with JSON and I hadn't actually tried until this point. IMHO the true power of PRQL comes from the fact that it allows you to define functions and with that you get the power of composability which is the true power of almost every programming language (and which is for the most part completely lacking in SQL).
So with that said, I tried the following POC (remember that PRQL is just a SQL generator so the JSON capabilities depend on your underlying RDBMS):
```sh
> prqlc compile <<EOF
let get = path obj -> s"""{obj} -> {path}"""
let getstr = path obj -> s"""{obj} ->> {path}"""
let extract = obj path -> s"""json_extract({obj}, {path})"""
from [{data='{"duck": [1, 2, 3]}'}]
select { data | get '$.duck[0]', data | getstr '$.duck[1]', extract data '$.duck[2]' }
EOF
WITH table_0 AS (
SELECT
'{"duck": [1, 2, 3]}' AS data
)
SELECT
data -> '$.duck[0]',
data ->> '$.duck[1]',
json_extract(data, '$.duck[2]')
FROM
table_0
-- Generated by PRQL compiler version:0.9.4 (https://prql-lang.org)
```
What's going on here is that I used [s-strings](https://prql-lang.org/book/reference/syntax/s-strings.html) to define custom PRQL functions `get`, `getstr` and `extract` which translate into the underlying `->`, `->>` and `json_extract` SQL constructs.
You could then for example pipe that query to DuckDB (the example is taken from the following DuckDB blogpost [Shredding Deeply Nested JSON, One Vector at a Time](https://duckdb.org/2023/03/03/json.html)):
```sh
> prqlc compile <<EOF - | duckdb
let get = path obj -> s"""{obj} -> {path}"""
let getstr = path obj -> s"""{obj} ->> {path}"""
let extract = obj path -> s"""json_extract({obj}, {path})"""
from [{data='{"duck": [1, 2, 3]}'}]
select { data | get '$.duck[0]', data | getstr '$.duck[1]', extract data '$.duck[2]'}
EOF
┌───────────────────────┬──────────────────────────┬───────────────────────────────────┐
│ "data" -> '$.duck[0]' │ ("data" ->> '$.duck[1]') │ json_extract("data", '$.duck[2]') │
│ json │ varchar │ json │
├───────────────────────┼──────────────────────────┼───────────────────────────────────┤
│ 1 │ 2 │ 3 │
└───────────────────────┴──────────────────────────┴───────────────────────────────────┘
```
It may be typical of many SQL users and formatters, but it leaves a poor taste in the mouth that you aren't interested in an actual comparison but in marketing.
For those who already know SQL, the real question is: will it make my queries faster? Putting the FROM first isn't sufficiently compelling on its own. Having a processing pipeline, though marginally more elegant to look at, doesn't actually improve upon CTEs.
When you say you can use it with any database, how do you handle functions, stored procedures, jsonpath, and the massive differences in functionality between Oracle, MS SQL Server, Postgres, DB2, MySQL, MariaDB, H2, SQLite, etc.? Lowest common denominator?
After 49 years of SQL, more than syntax has to change; you need an engine that supports this natively and can actually improve planner behavior over existing engines.
I will grant that if you are limiting your target audience to primarily analytics, it's probably sufficient. The marketing of PRQL doesn't always appear to do this however.