Hacker News new | past | comments | ask | show | jobs | submit login
Challenges students face when learning to work with relational databases and SQL (growkudos.com)
138 points by gousiosg on Aug 29, 2021 | hide | past | favorite | 126 comments



SQL is hard not because of terrible syntax, but because the underlying logic of defining exactly what you want, is difficult.

SQL, while not perfect, is very compact and direct. It allows you to express what you want succinctly and without boilerplate. No classes, no variable declarations (sure you can DECLARE a variable, but it is rarely needed), no dependencies or imports.

There's a reason why, despite the promises of every BI tool that it will "simplify" your database and "empower users", none of them have toppled SQL or even added anything useful that SQL could incorporate.

Graphical query designers are nice but have limited capabilities. SQL could occasionally be less verbose and IDEs could probably do better in reducing keystrokes (better autocomplete), but SQL itself, overall, is pure and brilliant.


Yeah - my class covered the language and meanings of the terms in a couple days. It's wonderfully simple. Which makes it hard in the same ways programming is hard - arrays are trivial! People still screw up bounds checks routinely!

Figuring out what you want, and understanding your data well enough to know what's viable and what's nonsense, is infinitely harder. And it changes every time.

Getting good at that part is "expertise" in a nutshell - gradually learning what strategies work and when, and getting better and better at your guesses. That takes more than a few weeks; that's an entire career.


> There's a reason why, despite the promises of every BI tool that it will "simplify" your database and "empower users", none of them have toppled SQL or even added anything useful that SQL could incorporate.

Pivoting data is still extremely painful using raw SQL


That's a semi-valid objection but I'd argue that's not really within the scope of SQL (and the PIVOT keyword is not an official part of the language, I believe). SQL is a language to interact with databases, SELECT queries are specifically for extracting data out of a database. Pivoting is typically something you do with data you've already extracted, that currently resides in memory, which is why it is fast, there's no additional data retrieval each time you change the aggregations or filters in a pivot table. Put it this way -- a SQL-based pivot table engine, that re-queries the database with every change, would perform awfully compared to a simple Excel PivotTable.

Similarly, you can't use SQL to apply color formatting to any of the result rows or values -- but that was never a goal SQL intended to achieve.


> and the PIVOT keyword is not an official part of the language

Yeah, that's probably a large part of the reason why pivoting data is such a pain in the backside.


> PIVOT

The postgres dialect has it IIRC.


and also because you have to pay to use Q and kdb+


I takes days/weeks to pick up the core of SQL:

- create tables, update the schema, insert rows, add an index

- select, filters, joins, order by, limit, inner queries

It takes forever to be comfortable with:

- anything that involves summarizing, grouping, having, min, max, windows


I think one tip here is to always work with CTEs "WITH", and not nest queries. That way you can always go back and check "what is it I join with what", by querying the individual steps with LIMIT 10.

The other tip is to sketch the problem in excel/google sheets when it gets hairy. Not the actual code (I don't have a clue how to do that, others have), just the values in the different steps. In the end it is only about rows and columns.

But that said, these days a lot of it happens intuitively for me, I pretty much know the solution before I can spell it out. It certainly was not like that when I started.

When you begin, "programming without for loops" feels like programming with your right hand tied behind your back. But in hindsight you get a lot of exercise in the immutable paradigms of functional programming, working with comprehensions, sets, maps folds comes very natural.


Be careful with blanket CTE recommendations. They tend to deoptimize with writes in really surprising ways. If you’re just reading data, they’re great.


I would still say they are not great as most engines just... write your sql out, instead of saving/memoizing anything.


All of my experience here is with Postgres, where (in recent versions) read-only CTEs are basically temporary views with all the built in optimizations that come from that.


This is what I thought would be a good tip when I first learned about CTEs. Turns out I use them sparingly and much prefer to create views to represent steps of data refinement. The benefit of doing it this way is that I can inspect intermediate results. Often I use *.sql files for this so I can keep the SQL formulations for later; also, as long as stuff is still experimental / under development, I start out with `begin transaction;` because then I'm guaranteed I do not permanently change anything in the DB I'm working on and also I can repeat all steps without having to care about `create or replace view ...`.

Oh and lest I forget you can't just re-use a CTE in another query. But you can of course re-use a view. Also given what another user here remarked, Postgres might internally treat (and optimize) CTEs like views, so to me that makes views superior to CTEs in more or less all respects.


Have some links on using CTEs to replace nested queries and how it helps ?

I've used SQL enough to have to write nested queries, haven't dove further than that.


Here's an article with some examples:

https://learnsql.com/blog/sql-subquery-cte-difference/

You can think of CTEs as a way to save an intermediate query as a sort of temp variable/table that can be used in the final statement following `WITH something AS (...)`. They are great for flattening your queries and giving descriptive names to subqueries, and you can chain multiple WITHs together as well.


This is good advice but sometimes you will need to nest things depending on database and optimizer and how it handles CTEs, especially if you ever mess with recursive things. An additional problem here is that there’s no standard way to represent EXPLAIN queries across systems so that makes an additional barrier to entry unfortunately.


> An additional problem here is that there’s no standard way to represent EXPLAIN queries across systems so that makes an additional barrier to entry unfortunately.

What so you mean by this?


The outputs from each DBMS are often specific to their implementation, even when there is a lot of overlap. The good news is they are usually similar enough and user friendly enough that an advanced user will have no issue but for beginners it can be daunting.


Anyway beginners learn by just using single RDBMS, isn't it?


Not necessarily. I often see people learn starting with SQLite because it's easy/installed (and quite a bit of tutorials start with it, especially for Python), and then move to MySQL/Postgres, but that is N=2 so it's not too bad.


Yep, CTEs are a huge boon to structure your SQL - use them where you can.


I've always found it hard to articulate the problems that i have with SQL and with the "WITH" CTEs, but let me try anyways.

For starters, i can never actually tests parts of those queries without rewriting the query up to the part that i want to test, for example:

  WITH 
    query_one AS (SELECT ...),
    query_two AS (SELECT ...),
    query_three AS (SELECT ...)
    SELECT ... /* main query */
If i want to test the second query, i need to take the first and second ones, copy them into a new worksheet and then rewrite the second one not to have the alias but instead be the main query. This is annoying when you have 5-10 CTEs and you need to test something in the middle.

Then, working with SQL and CTEs feels like going back from a language where functions are first class citizens to one where no such thing exists, just in regards to querying data. It would be nice if i could store parts of queries under packages, to be able to write dynamic SQL more easily, instead of having to use tools like myBatis for this purpose: https://mybatis.org/mybatis-3/sqlmap-xml.html (see the bit about SQL fragments)

So i'd like to do the following:

  PACKAGE my_snippets BODY IS
    SNIPPET query_one
      SELECT ... /* probably 500 lines long but often used snippet */
    END query_one;
  END my_snippets;
  
  /* and then, somewhere in code */
  WITH
    query_one AS my_snippets.query_one,
    query_two AS (SELECT ...),
    query_three AS (SELECT ...)
    SELECT ... /* main query */
Now, you might suggest that using views works for this intent, but what about most DBMSes out there having silly naming rules and restrictions? I don't want to work with v_mtz_wg_priv_prod_attr because someone thought that having just a few dozen characters makes sense as a restriction. Furthermore, you really can't group views into logical packages based on their intent, now can you? So, with views you end up with something that's very much like your cluttered list of tables, which gets really hard to get a good overview of when you have about 300 of them.

Next up, debugging in databases is just really bad. How am i supposed to put logging in the queries, without mixing the logging code with the other triggers and tables? What about debugging long running processes? What about adding breakpoints that i can trigger when a particular view or table is accessed? What about doing this on the server while i have a local app instance connected to the DB, or maybe even another app server? Why can't i step through the query execution and see how the filtered record count changes with each "step"?

Apart from that, my problems are largely with the tooling around databases. There are relatively few universal (cross language) DB migration solutions out there, for example dbmate, every framework seems to have its own approach. There seems to be this odd division between procedural SQL and regular SQL statements, where what you can do differs based on context, which is inconsistent. Procedural languages as a whole vary wildly in what they can do - you won't be doing complex logic with custom types on MySQL/MariaDB anytime soon, whereas Oracle or PostgreSQL will suffice. But even those two have different dialects, it's never "just SQL". There are oddities with selecting certain kinds of data, only pgAdmin seems to work nicely with geospatial data, but apart from that i've also seen problems with using lower level JDBC logic which you can't really test outside of the app, in something like SQL Developer. But even apart from that, as much as we like ER diagrams, MySQL Workbench is the only tool that i've seen which allows you to actually do model driven development properly and synchronize schemas and do forward/backward engineering - even pgAdmin fails at doing this. Oh, and the tools themselves are really inconsistent - you'll see a world of difference between MySQL Workbench, pgAdmin, SQL Developer, JetBrains DataGrip and others.

And now those DBMSes are attempting to add more functionality, such as exposing REST interfaces, instead of fixing the underlying and dated problems, because people out there are relying on those and therefore the logic is set into stone. It's no wonder that every year there's a new product or two that attempt to improve upon these, even if most of the time those products die out.

Perhaps the above is a stream of consciousness with some annoying things that i've dealt with over the years, but personally, relational databases are something that i use because they're often the least horrible tool for the job, even if they are not pleasant or easy to use, at least as easy as they should be. That's where i think the main problem lies - tools should be good for solving the problems on which they'll be used, these ones aren't.

Someone with 20 years of experience might have a different outlook, but personally i'd suggest that you utilize DBMSes for what they're good for - storing, retrieving and manipulating data and don't get too carried away with in database processing otherwise, since doing certain things within the app code seems to scale horizontally far more easier in some situations, has better auditability, debugging etc.


Long reply for a workday, but a quick reaction from your first example (with a slight risk that I misunderstood your problem).

What you do here is to also wrap the main query into a CTE and then end everything with SELECT * from main_query.

Then you can easily change that last clause to do SELECT FROM query_two while you keep everything as is, even the CTE you called main_query.


That's indeed a useful trick, but wouldn't it be better to be able to execute pieces of SQL without altering the actual code?

For example, selecting the CTEs up to a certain point and using a particular button or keyboard shortcut within a development tool, much like we can already do with executing selection?

Of course, adding "clever" functionality like that might as well create some risks and inconsistencies, so i'm not entirely sure about that.


Yes, this is only a quick trick and does not solve the general question about unit/integration testing that you are used to in other types of programming.

This is an entirely different subject and deserves a long discussion, and one which SQL is not ideal for.

As you said, breaking CTEs into separate views is the start, and then you can use a tool like getdbt.com to make your references into parameters. And then you also need to create the mock data (which you can do for example by writing it into csv:s)

From the role of an analyst I must say though, that once you have done all that work, the risk is that you forgot to worry about to the biggest risk here. What is actually inside that data you are querying? Maybe your biggest problem is not the logic of the query, but rather how dirty your input data is? Or for that matter, that you made completely incorrect assumptions on your input data, like that column X contains distinct values when it contains duplicates. That type of error wreaks havoc on your end result, with a big chance you'll never notice.


I have to agree with most of what you're saying. Another commenter also voiced concerns about how SQL is sometimes a bit difficult to work with because the underlying domains and concerns that are handled by it are also inherently complex in many ways.

In regards to the correctness of the data and the quality, there are at least constraints that you can put into the DB, but i've personally seen plenty of cases where that isn't even considered and is forgotten about, without even getting into the OTLT and EAV anti-patterns and the implications that they may have: https://tonyandrews.blogspot.com/2004/10/otlt-and-eav-two-bi...

In every case where i've seen someone attempt to introduce polymorphic foreign keys, a lot of those consistency guarantees and control mechanisms have gone out the window, since you can't really have conditional constraints or complex logic in there either. Though thankfully not everyone has to deal with things like that in their projects.

One cannot overstate how important modelling your data is, to the point where schema-first is a strong and reliable approach most of the time.


Postgres gives you named views, named schemas, and transactions that undo even table and view creations, so that's what I use to implement an iterative dev cycle.


Not so sure about aggregation, but what definitely takes forever is performance.

To make a non-trivial SQL query scale to non-trivial amounts of data, you have to understand the physical data organization and how query optimizer is likely to use it, which is kind of contradictory to the idea of the SQL as a "declarative" language where you just say what you want, and let the query optimizer figure out how to get it.

Instead, you have to design your indexes carefully to coax the optimizer into choosing a reasonable access path for your particular query. And do the same for all queries where performance is important.

Indexes are fundamentally not about data, but about access patterns. Which is what the developers are responsible for. That's why physical database design is a development task, not database administration task.


I think RDB is one of the most leaky abstraction. Despite that, it's still very useful.


Something other which is hard is writing performant queries. Using statements with subqueries/in syntax for example.

And I always forget which join does what.


Remembering which join does what is easy: inner joins strictly joins the tables, left takes all on left (first table), right takes all on right (second table), outer (or cross) join are so rarely used you don't need to memorize.


> outer (or cross) join are so rarely used you don't need to memorize.

I’d say they are more important to know than right joins.


same with the joins. on paper it makes sense, in practice it does not. if its more than a "select * from dbo.whatever where column abc = 'thing'" i have to refer back to notes and play with it.


Basically inner/left/right is a choice on how you want your NULLs: do you want all the rows that match, all the rows on the left, or all the rows on the right?


Meh, summarizing, grouping, etc aren't that hard.

However WINDOW queries definitely have a learning curve. Not the least because useful examples almost always require you to use a nested query.


I learned SQL on a need to know basis. For me, recursive queries were the ones that needed the most time to click.

Another one that caught me by surprise was NULL vs unknown[1]. That bit me in a couple of queries.

[1]: https://learnsql.com/blog/understanding-use-null-sql/


The problem with NULL in SQL is that the semantics are inconsistent in complex ways. See for example https://vettabase.com/blog/what-does-null-mean-in-sql/.


Recursive queries are definitely top of the complexity pile IMO. However I usually discount them because I am yet to actually need them in a production environment. Window functions are super useful on the other hand.


Heh, perhaps it's more accurate to say I've abused recursive queries.

One instance was to turn a column containing comma separated values into rows[1], so I could join on them. Wasn't for a query that needed performance of course.

[1]: https://news.ycombinator.com/item?id=28020321


I had this same issue in Redshift and ended up populating a table with values 1 to the maximum number of commas found (e.g. using max(regexp_count(...)) or something), then cross joining on the table with the csv column and calling split_part on the corresponding column and index (with the index coming from the numbers table). The cross join ensures that you index every value of the csv column.


What I think you'd do here (in modern implementations of SQL at least) is to split the values into an array using some convenient split function and self join on that array to pivot the arrays into rows.


> some convenient split function

If only the DB we're using had one of those :)


A lot of people try to do stuff that doesn't actually make sense when it comes to groups. Like selecting a column which isn't contained in the group by. And they are confused by the error but when you talk them through it "What did you actually want to see? There are multiple values for this column now" it starts to become clear to them.


Not really. I used to teach SQL not a long time ago and about 1/4 of the trainees were getting up to speed fast, about half in a reasonable time, the rest were there only because they were sent there by their managers.

I found that the most important success factors in learning SQL is the analytical thinking of the trainee and the way the trainer is explaining the concepts, in what order and what examples are used (the best examples are the ones the trainees meet in their regular work).

The functions are simple, the only difficulty is to remember the ones that are not used often enough (ex: some window functions). Even in that case, a quick check in the documentation is enough to get up to speed. The major difficulty with SQL is to write efficient queries on large data volumes, covered by the right indexes. This is very specific to each RDBMS, especially because of the tools helping with the work are specific (ex: SSMS, SQL Sentry Plan Explorer, statistics parser etc).


If all you do is SQL, anybody can learn it quick. If you do full stack, you're only gonna care about what gives you the data you need at that time.


No offence, but a full stack's job is not to write good SQL, it's to write enough SQL to get what is needed, then the development DBA's job is to make it fast and efficient. You don't need to be a great car mechanic to drive to the office and back.


I have never worked somewhere that has a dedicated database person. The SQL goes in the application code so it is the developers job to commit stuff which is performant and safe.

There is no point having developers write sql at all if someone else has to come in and redo it after.


I have a team of 4 people doing part time exactly this: improving performance of queries from various apps. My team does not know the business logic (many apps, not enough people), the developers don't have the knowledge to build good SQL code for databases with tables of hundreds of GB each. If you write small web apps it is not a problem, but if you have hundreds of servers with that size of databases, development DBAs are a must.


You assume a large company with lots of specialised roles. But lot's of full-stack developers work in small companies where they are the only DBAs. I for example have never worked at a company that employed DBAs.


At most places I've worked, there are no DBAs, and the devs are responsible for optimising all the SQL.

On the occasion that there have been DBAs at the company I worked for, they always refused to help with any SQL, on the grounds that all SQL is "application level", and insisting that they were only responsible for configuring / deploying / monitoring the DB infrastructure.


The configuring/deploying/monitoring the DB infrastructure DBAs are production DBAs. The ones tuning query performance are developer DBAs. If you put the large queries in stored procedures instead of the application, they can be easily touched and improved by DBAs.

No need of DBAs if your database is up to a few GB, you cannot live without DBAs if you exceed 100GB. I have several hundred SQL servers with databases exceeding 1 TB, on average several hundreds of GB each. This is where performance tuning is essential.


Personally I feel that it took me a while to get really comfortable with more complex joins. There’s an problem they used in the study that required joining a table with itself, and honestly I would probably take a while to come up with that answer, if at all.


A basic approach(probably what they are going for in a basics study) would be something like this off the top of my head

    select c.cid, c2.cid
    from customer as c
    inner join customer as c2 on c.street = c2.street
    where c.city <> c2.city
though that has reflective duplicates say (1, 5) would also have (5, 1) in the output. So I'm not sure if that's "allowed"


There are lots of cases where a join with yourself is applicable, although they are mostly superseded by window functions these days.

For example normalisation (join with a groupby/sum of yourself) or rank (join each row with all rows that have lower value than yourself and count those rows).

But as I mentioned above. A good start is to sketch that out in excel. You will realize that what you need is another column (e.g. total sum for this id). And from that you can work yourself backwards to figure out what is the table you need to join with to create that column.


I'm not sure if you are replying to the wrong person but the question has nothing to do with a total sum of ids...

the question was: "List all pairs of customer IDs who live on a street with the same name but in a different city." listed under self-join

that said i haven't wrangled with raw sql in a spell so the reading on window functions is interesting.


1. GP mentioned hairy self joins 2. You replied with an example 3. I pointed out that there are other examples.

That said, I have a feeling your duplicates can be fixed by adding the requirement that c.cid < c2.cid

Not sure a window function would help in this particular situation, but they are there to help in more mundane examples.


I see how it can be read that way yeah. they mentioned the study used a self join problem that they couldn't likely figure out so I looked at the paper and quickly did it off the top of my head. the less than predicate does indeed do the trick.


and worst of all, anything that involves vendor-specific keywords…


My first encounter with 'SQL' was a course on relational algebra that was taught at my university.

It started out with defining relations as a mathematical construct, and continued with various operators on such relations. Then they continued explaining the various normal forms up the fifth normal form. I was completely out of my depth, but at least it was good and solid theory that could be learned.

What really messed with my head is they then introduced SQL as a 'practical' implementation of relational algebra. I'm still having nightmares where I try to understand nested HAVING statements that where asked at the exams.

Hey relations don't contain duplicates! But that's OK. We should call (modern) SQL 'BAG ALGEBRA'.


Yes, SQL is about bags, not sets.

And anything having to do with aggregation cannot be expressed in relational algebra.


Don't forget loops


In my experience, inexperienced database developers pick up SQL fairly quickly under the guidance of an experienced mentor.


This.

Usually I see people struggling to formulate questions. They know what they want, but don’t understand how to get there. Left to their own devices, they hack up some nightmare in Excel.

I worked with a summer intern on creating reports and learning SQL. She was a really smart business major who ended up with the wrong work assignment. I was getting 5-7 questions a day from her in June, 1-2 a week in July and by the time I got back from vacation in August, she had basically done about 90% of a project that was going to be hired out and was showing me some features of the database we were using that I didn’t know!

It inspired her to switch majors and she is a fancy data scientist somewhere! Awesome mentor experience.


+1 for mentoring and pairing. Or mentoring via pairing. Getting with and expert, or even just another set of experienced eyes is a big help


> For example, some students wrote queries containing ,≠, instead of != or <>.

Was this done on paper? Typing ≠ takes some doing.


Yeah, IMO counting this as a syntax error is a pretty low blow. It's completely clear what the person intended, and they would probably have no problems clarifying if the researchers asked how they would type that query in.


From the study:

> Participants wrote their notes and answers on paper, which they showed in front of the webcam.

Yes it seems they did


That's pretty flawed methodology since you'd want to know what problems people encounter in the real world and how quickly they solve them.


I mean, who writes queries on paper? And by pressing Execute, SQL will gladly provide error message that the student will, I think, quickly resolve. Is this a contest that measures how well students can write compile-able code without executing it or what?

Moreover, IDE would have helped those students for sure: SQL keywords are highlighted in different color + autocomplete.


Easy if you have an APL keyboard :) (Hint, it's on the 8) https://www.dyalog.com/uploads/images/Business/products/us_r...


Option-= on a Mac with the US layout, fwiw.


Man, Mac's keyboard shortcuts for special characters irritate me so much.

First, they're there, and it's absolutely wonderful! I use far more semantically accurate Unicode rather than lossy ASCII approximations than I did back in my old Windows days. (If you don't know the special characters you can get, turn on Keyboard Viewer and whack your keyboard, especially modifier keys, a bit.)

But … I can't customise them. Even back in the days when macOS was OS X and believed in user customisation, these specific shortcuts were frozen and un-customizable. (Like the folder shortcuts in Finder. Maybe it makes sense to you for CMD-SHIFT-D to open the Downloads folder, not the Desktop. Too bad!)

(Boy, I hope I'm wrong and someone will come along and explain my stupidity to me.)


> these specific shortcuts were frozen and un-customizable

I think Karabiner should allow you to do this: https://karabiner-elements.pqrs.org/

It's a utility that "remaps" keys – you set up key/key combinations that fire the original key/key combinations (it does not remove the original combination). E.g. you could bind Cmd-Q to Caps Lock if you wanted a really fast way to quit stuff.


The memory of what happened with kext's always makes me leery of relying on anything that reaches too deeply into macOS's guts, so I've always shied away from Karabiner, probably unreasonably.

Old-style OS X believed in customisation—even now, you can set per-app keyboard shortcuts; it doesn't seem possible to bind Cmd-Q to Caps Lock, but, after slipping from Cmd-W to Cmd-Q and so quitting rather than closing a tab one too many times, I do have Cmd-Q bound to Cmd-Opt-Q for Safari only—and yet there's no Apple-blessed way of changing _those_ shortcuts, when it clearly had the architecture in place to allow it. That always irritated me.


Direct link to the research paper, instead of the summary: https://dl.acm.org/doi/pdf/10.1145/3446871.3469759 (pdf)


I noticed that the article doesn't mention relational-calculus at all, only relational-algebra. That's a huge oversight, imo - as I feel one needs to understand both RA and RC in order to grok SQL and other RC-like systems, like Linq in C#/.NET and List-comprehensions in Python (or even use those before RC/RA and SQL).

-------

Rather than improve how SQL is taught (which seems to be the paper's objective), why not improve SQL so it isn't as horrible to try to learn in the first place?

The barriers to grokking SQL could be lowered considerably if SQL made minor adjustments like moving the projection part of a SELECT query to being below or syntactically after the WHERE clause instead of being at the top, and making SQL more "natural" to write-in without needing excessively verbose inner-derived-table expressions when all you want is to do perform some repetitive calculation which will be reused in later query steps.

Also, the GROUP BY clause really needs to be renamed to "AGGREGATE BY" or similar, because when normal people think "group" they're probably thinking of sorting/ORDER BY or PARTITION BY and they certainly don't imagine "don't display these rows at all, lol".

I just don't understand what drives the ISO SQL language design committee - I'd have thought that the newer revisions (e.g. SQL-2003) would have improved the language's ergonomics - on the contrary: the language's grammar and verbosity gets worse every release, and the team has strange priorities: apparently they feel needing to generate in-memory XML is more important than deferrable constraints - and I only ever see ISO SQL's XML features being abused to make-up for a lack of decent string-aggregation functions.

(...I could talk for hours about everything wrong with SQL.)


The trick of becoming really good at SQL (applies to all areas of IT) is having a certain ambition to produce high performance beautiful (readable) code.

This requires experience which can only be gained by rolling up your sleves and working on stuff until the high ambition has been satisfied. Sometimes when I see old code from myself, and I can follow what I have been doing I get really proud of myself. Many times I end up slightly improving it based on new knowledge I have acquired since I initially wrote it.


Readable code can be easily done via good formatting, but performance requires a combination of writing the query in the right way and the indexes to support it. The second part is not even visible from the query and most of the time is not self-explanatory, but the best part is that indexing is not even universally valid, statistics decide execution plans and the same query with the same indexes can result in very different performance on 2 different instances.


High performance and readable certainly. Not sure why beautiful would be something to strive for though


I believe they're intending to say that beautiful code == readable code.


If the general computing community can agree on anything over the last 20 years, it’s that Python 2->3 was a disaster. Even though Python 3 made several highly necessary (and irreversibly transformative) changes to Python 2, no one liked it because it fundamentally changed the language to something unfamiliar.

I’m guessing the SQL and C++ committees looked at that transition and decided that such transformative changes really need to be done in new languages (like the Perl -> Raku change) rather than in a new version which risks alienating your existing base.


Oh of course - I have no doubt the ISO SQL committee is so conservative (no... they're regressive) is because of the sheer collective industry investment in not-only SQL tooling and SQL-compatible databases, but just energy-spent in teaching non-CS/SE/programmer types in businesses how to express their data-queries in SQL. It's very, very difficult to get the kind of industry cohesiveness around any technical standard, so the fact that SQL is so widely supported is a miracle (though it probably has something to do with US federal government requirements for information systems to support it, just like how POSIX is a thing because of the fed pushing for it).

To be clear: I am not advocating for a brand new query-language syntax or any kind of Python3-style overhaul, but I'd like to see SQL start to take small steps towards integrating the lessons learned from the past 60+ years of language design rather than doing the complete opposite.


The Python 2 to Python 3 migration was such a disaster that Python 3 is now used by both more programmers, and a higher proportion of programmers, than Python 2 ever managed.

This is not to minimize the pain of switching. But it does not seem to ahve limited the success of the language.


Python 2->3 was a disaster, but I'd refrain from extrapolation because a lot of that difficulty was very specific to dynamic typing or python itself. There are tons of languages that have gone through far more transformative changes in the core semantics of the language, and have gone a lot smoother.


One of the best ways to learn advanced oracle specifically is through the ask tom q/a. The question are difficult and the answers teach more than any course.

https://asktom.oracle.com/pls/apex/f?p=100:1000::::::


I have been developing software that includes SQL for twenty years, and watched my own mental progress from misunderstanding to understanding. I found the biggest initial problem is that I used to imagine SQL queries as an imperative language rather than as expressions of data. Maybe in the teaching of SQL, this should be highlighted so absolute beginners can have that mental model when they are formulating solutions and grappling with the syntax.


I don't know if that's enough.

Understanding the difference between declarative and imperative programming is rather hard with all the abstractions we have today.

People always say, declarative programming is defining what you want, not doing the steps needed to get it. But today no imperative interface requires you to do all the steps either, plus, most programming languages use both paradigms at the same time.


Understanding the difference between declarative and imperative programming is rather hard with all the abstractions we have today.

The distinction is are you telling the computer *how* to do it, or telling it *what* steps to take.

If, even with access to all of the code, you'd have to ask the computer how it chose to do it to figure out what it did, you have a declarative system. If the code reads like instructions for a recipe, it is imperative.

The complications come with the fact that these two paradigms do not describe all of the possibilities. Notably object oriented and functional designs are neither imperative or declarative. (But may share some features with both.)


Yes, that's probably what was always my problem.

Theory is one thing, but actual programming languages are something different. An "impure" mix.

So, when people told my language X is imperative and language Y is declarative, I got confused, because they often had parts of both.


Can you give an example? "How to do it" and "what steps to take" still sound like the same thing to me.


Sorry, I meant "what you wish done" and not "what steps to take".

A "how to do it" would look like this:

Search A, pulling out foo and bar. For each record you get, go to index B.foo, find where the B records are such that A.foo = B.foo. Now go to those records, and then read baz. Return all triples of foo, bar, baz that you found.

An equivalent "what to do" is:

    SELECT A.foo, A.bar, B.baz
    FROM A JOIN B ON A.foo = B.foo;
Note, you're not saying which table to go to first. You're not saying which index to go to. You're not saying how to do it. And indeed if there are several possible ways to do it, you don't know what it actually chose to do unless you do an EXPLAIN ANALYZE to ask to tell you.

Honestly I highly recommend the exercise of building a toy database, complete with indexes and simple tables. Then take a few queries against a real database, look at EXPLAIN ANALYZE, and then write the same code against your toy database. See how much code there is. Walk through what it does. Compare to what the real database did. Etc.

It is an amazing exercise for understanding what a database does for you. And you haven't even dealt with transactions, concurrency, etc, etc, etc.

Furthermore if you do run into one of the rare cases where the database is truly the wrong tool (this has happened to me a couple of times), knowing what a database would have done is very helpful in figuring out how you can do it better.

(Honestly, doing the same thing as the database, but in a low-level language like C++, without transactions and concurrency, and with specialized data structures, can easily lead to 1 to 2 orders of magnitude performance improvement. It is rare for this to be a good trade-off. But having done the exercise that I described when it didn't matter will position you well if you do need it.)


I still feel like a functional query language would have made for an easier learning experience:

  PROJECT(
    INNER_JOIN(
      "A",
      "B",
      EQUALITY_EXPR(
        COLUMN_EXPR("A", "foo"),
        COLUMN_EXPR("B", "foo"),
      ),
    ),
    ["A.foo", "A.bar", "B.baz"],
  )
Conveys the same intent with a much more consistent syntax than a pseudo-English grammar.


Thank you!


How to do it:

for(i=0;i<a.length;i++) { b[i] = a[i] + 10 }

Define your wish:

b = a.map(x => x + 10)

But things get muddy because many languages offer higher level for-loops and forEach functions, etc.


I guess my point was that if you are trying to achieve results in a language paradigm that isn't the one the language was designed for, the learning curve is _really_ steep, and to use the declarative features of any language still requires you to understand the paradigms' differences. When I was at uni it wasn't until the final year that Programming Paradigms was a course, but even a rough introduction to them in any of the languages I studied earlier would have helped. Something like, "This language is used like this. It is not used like this, for example."


I'm not even sure, I totally understood the difference now, 10 years after I studied CS.


I've made significant $$ over the years being a person who can understand the imperative code the database server will likely execute to make that declarative goodness happen.


This is actually quite important.

Some tasks that are seemingly simple in a normal programming language can sometimes be impossible to achieve in SQL (e.g. dynamically generated columns...)


Thats where you create a massive beast of code which dynamically generates SQL from fragments with string interpolation!


Exactly. Thats when you create a massive beast of code which dynamically generates SQL from fragments with string interpolation... In SQL.


To me the most difficult part is learning sql before knowing what can be done with a computer, both on the complexity and the language design part.

Before learning interpreters/compilers/prolog, I'd spend a lot of time trying to figure out about naming/namespaces in queries, while after doing some PLT, it all becomes very very obvious, you can now focus on the operators and since you'd know how far can programming go, you'd see faster how nested queries could make sense, what aggregating functions meants etc


I see they mentioned previous course knowledge - this is something you see in many (programming) classes.

Students that have zero prior knowledge in programming, are able to pick up functional programming pretty easy. Students that have studied and used paradigms like OOP, seem to have a hard time grokking functional programming - as they see everything through the lens of OOP (and the languages they've used).


Functional and OOP techniques do not seem to be best friends.

http://steve-yegge.blogspot.com/2006/03/execution-in-kingdom... uses Java to discuss what this can look like in an extreme case.


I can’t count how many times I’ve found developers doing “select *” and doing all of the logic and sorting themselves. That goes horribly wrong when you get to production and the database is larger then ram. Then they just stare at you and blink in total disbelief that such a thing is possible.


> This was not always the case for our participants, who wrote queries with synonyms of the correct words, leading to queries that will not be executed.

This shows that those participants lack very basic foundational knowledge. It doesn't surprise me, because in my experience all programming courses that taught SQL early have been terrible.

Nobody who already has a basic understanding of computer science would make this mistake.

On the flipside there's really no point of devoting much time to teaching SQL later, because once you have a good understanding of data structures and algorithms, it is rather easy to make educated guesses of what is happening behind the scenes in a database - and you would have no trouble of teaching yourself SQL if necessary at some point.

Not to mention that teaching databases before what makes up their implementation is teaching software development in precisely the opposite way it is practiced: the composition of lower level concepts into higher level abstractions.

Last but not least, when you're teaching future software engineers, at the end you don't want them to just say "I can use this", you want them to say "I could build this". Teaching SQL early smells like surrender.


SQL has a steep learning curve. It expose almost zero insights into the underlying query execution. As a result, increasingly amount of inefficient queries are being written by ML engineers, who in general care little about query efficiency. The solution right now seems to have a team of data engineers to optimize the queries.

Should we think about an alternative, at least for ML ETL workloads?


>SQL has a steep learning curve.

Does it? I think SQL just sucks and its tooling sucks too

Even SQL Management Studio which felt way better than PGAdmin is miles behind IntelliSense that's offered by Visual Studio for C# (when it comes to reliability)

SQL would benefit a lot from being like C#'s LINQ (Query syntax) e.g:

var result = from s in stringList

            where s.Contains("Tutorials") 

            select s;

some SQLv2 is something we need


try DataGrip, it's got really nice intellisense and autocomplete. I'm not sure how your example from linq is any better than SQL

   select s from stringList where s like '%Tutorials%'


Sometimes I wonder, why does SQL syntax has select at the beginning. The Query execution order is FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> DISTINCT -> ORDER BY -> LIMIT.

Isn't it more intuitive to write?

  from stringList where s like '%Tutorials%' select s


SQL was designed back when https://en.wikipedia.org/wiki/Fourth-generation_programming_... was a thing. One part of that was a preference for "more natural" syntax, in a sense of mapping closer to regular English. The rationale was that a sufficiently high-level language with "natural" syntax can be easily taught to non-programmers, so that they can make reports directly etc.

This supposed benefit was never actually realized; instead, we got a language more convoluted than it had to be, and programmers specializing in solving hard problems with it. This seems to be one of those cyclical things in the industry where each generation of engineers has to re-learn by repeating the same mistake - witness all the "low-code" and "no-code" buzz today.


The difference is when you type "select (here)" then your tool cannot really give you hints about column names in table

because you haven't specified the table yet.

Meanwhile LINQ starts with table name.

Ofc you can always go ahead and write queries in "different order"


It's annoying for sure but a smart editor could simply offer you all possible columns in the current context, moving the more likely candidates to the front (as determined through some sort of machine learning - which is in fact what major IDEs have started to do lately) and then automatically append "from $table". Same thing with joins.


>It's annoying for sure but a smart editor could simply offer you all possible columns in the current context

I'm using databases with >500 of tables

I guess it's not that trivial



> SQL has a steep learning curve.

Overall I don't think it is that steep, though maybe I'm blinded by having worked with various implementations of it for more than two decades. The key sticking point is jumping to thinking in a set based manner to get best results. The rest of the difficult parts are when you need to think about implementation details because the query planners are no perfect (index hints and such) or being aware of limitations (like postgres before the latest major version having optimisation fences around CTEs).

> It expose almost zero insights into the underlying query execution.

That is pretty much by design. It is intended that you say what you want and let the query planner worry about implementation details. Of course how you tell it what you want involves learning to express those intentions in SQL. It does fall apart a bit when implementation limitations become an issue, at which point you are forced to think about the underlying implementation and how you might prod this more imperative code so that it interprets and process your relational descriptions most efficiently.

> As a result, increasingly amount of inefficient queries are being written by ML engineers

That isn't specific to ML. I see a lot of inefficient data interaction from code written by other devs. This seems to be for two reasons:

1. People seem to have taken to heart “make it work, make it work correctly, only then worry about making it work fast” to heart but tend to skip that last part and assume because all is well with their test sets of data at hundreds or thousands of rows (or sometimes tens and singles) that it'll scale just find to the hundreds of thousands or more that the clients datasets will eventually contain.

2. People using further abstractions without much care for how they implement their directives (again, in an ideal world they shouldn't have to), resulting in massively overcomplex queries as the framework tries to be clever and helpful and preempt what might be needed, getting everything whether needed or not (effectively `SELECT `) meaning the query planner can't apply families of its internal tricks for better performance, or getting many rows individually instead of as a set which sometimes means a lot of extra work for each row.

There is a definite “we'll worry about that when it happens attitude in both cases which is dangerous. While a live system has practically ground to a halt and the client needs their report by EOP or someone will get it in the neck (and be sure: they will pass that on to you!) is not a good time to be optimising data access, or worse finding out the structure just doesn't support efficient generation of the required data. Another common failing is applying what would idealy be UI or BLL concerns (timezone conversions etc) in the SQL statements in a way that blocks index use.

> Should we think about an alternative, at least for ML ETL workloads?*

I don't work with ML so that is a little outside my day-to-day wexpertise, but I'd wager ETL there has the same problem as everywhere: the basics are all well known and very well optimised for already. The rest differ so much between applications that no one abstraction would be optimal for more than a small portion of real world needs.

I'd be wary of a separate team for optimising queries. I suggest a reasonable understanding in the whole dev team with a data expert embedded who is involved in design work and code reviews so issues are caught early and junior devs can be tutored as needed so by the time they are seniors they don't need the data expert except for really gnarly problems or long-term planning.


I've built a desktop app where you can load a CSV and start writing SQL on it. A lower barrier to try it out might be helpful in learning SQL.

See: https://superintendent.app



Have been using openai codex for a week and it's shockingly good at SQL with well defined prompts.


I've done backend web development with a relational DB via an ORM for 10 years. I'm OK at that, but I'm fucking hopeless at SQL. I know that my opinions are thus undermined, but I really wish we could get rid of SQL and replace it with something like logica [1] like today.

SQL's pseudo-natural language syntax is an embarrassment and its lack of composability is even more of an embarrassment.

[1] https://opensource.google/projects/logica


Have you used CTEs much (aka the WITH statement)?

I find them to be a huge step forwards in terms of adding composability to complex queries.


Yes I concur: CTEs is closer to the spirit of relational algebra: every step/expression should yield a table/relation.

As data munging is about combining/correlating/sorting/grouping data, why not have a sound (bag) algebra to do that? Such algebra would give us equational reasoning, proofs, etc.

And consequently: students would be learning an algebra which is easier to learn IMO.


Thanks yes I have learned to use the WITH statement, and I agree it gives more composability. But still, a half way house wouldn't you say?


gone are the days of:

- SELECT name, address

- FROM user, location

- WHERE user.locationid = location.id


Step 1: "SQL sucks!"

Step 2: Let's make a database engine that doesn't use SQL.

Step 3: "This is hard!"

Step 4: Make SQL access layer.

Wash. Rinse. Repeat.

See: PartiQL

Those who ignore the lessons of SQL are doomed to reimplement them...poorly.


Corollary:

1. "SQL doesn't scale!"

2. We made this database engine that's "web scale"!

3. "This is hard to use!"

4. Make SQL access layer.

See: Spanner

Those who blame SQL for their performance problems are doomed to repeat them using a proprietary syntax.


This is completely orthogonal to having a saner query language for a relational database. SQL is like C - it does the job, but it has so many legacy warts, it's not all that hard to do better. The only real problem is overcoming how entrenched it is in the industry, which requires any replacement to be substantially better than SQL - so that the gains can justify the logistical costs of switching.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: