SQL has a steep learning curve. It expose almost zero insights into the underlying query execution. As a result, increasingly amount of inefficient queries are being written by ML engineers, who in general care little about query efficiency. The solution right now seems to have a team of data engineers to optimize the queries.
Should we think about an alternative, at least for ML ETL workloads?
Does it? I think SQL just sucks and its tooling sucks too
Even SQL Management Studio which felt way better than PGAdmin is miles behind IntelliSense that's offered by Visual Studio for C# (when it comes to reliability)
SQL would benefit a lot from being like C#'s LINQ (Query syntax) e.g:
Sometimes I wonder, why does SQL syntax has select at the beginning. The Query execution order is FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> DISTINCT -> ORDER BY -> LIMIT.
Isn't it more intuitive to write?
from stringList where s like '%Tutorials%' select s
SQL was designed back when https://en.wikipedia.org/wiki/Fourth-generation_programming_... was a thing. One part of that was a preference for "more natural" syntax, in a sense of mapping closer to regular English. The rationale was that a sufficiently high-level language with "natural" syntax can be easily taught to non-programmers, so that they can make reports directly etc.
This supposed benefit was never actually realized; instead, we got a language more convoluted than it had to be, and programmers specializing in solving hard problems with it. This seems to be one of those cyclical things in the industry where each generation of engineers has to re-learn by repeating the same mistake - witness all the "low-code" and "no-code" buzz today.
It's annoying for sure but a smart editor could simply offer you all possible columns in the current context, moving the more likely candidates to the front (as determined through some sort of machine learning - which is in fact what major IDEs have started to do lately) and then automatically append "from $table". Same thing with joins.
Overall I don't think it is that steep, though maybe I'm blinded by having worked with various implementations of it for more than two decades. The key sticking point is jumping to thinking in a set based manner to get best results. The rest of the difficult parts are when you need to think about implementation details because the query planners are no perfect (index hints and such) or being aware of limitations (like postgres before the latest major version having optimisation fences around CTEs).
> It expose almost zero insights into the underlying query execution.
That is pretty much by design. It is intended that you say what you want and let the query planner worry about implementation details. Of course how you tell it what you want involves learning to express those intentions in SQL. It does fall apart a bit when implementation limitations become an issue, at which point you are forced to think about the underlying implementation and how you might prod this more imperative code so that it interprets and process your relational descriptions most efficiently.
> As a result, increasingly amount of inefficient queries are being written by ML engineers
That isn't specific to ML. I see a lot of inefficient data interaction from code written by other devs. This seems to be for two reasons:
1. People seem to have taken to heart “make it work, make it work correctly, only then worry about making it work fast” to heart but tend to skip that last part and assume because all is well with their test sets of data at hundreds or thousands of rows (or sometimes tens and singles) that it'll scale just find to the hundreds of thousands or more that the clients datasets will eventually contain.
2. People using further abstractions without much care for how they implement their directives (again, in an ideal world they shouldn't have to), resulting in massively overcomplex queries as the framework tries to be clever and helpful and preempt what might be needed, getting everything whether needed or not (effectively `SELECT `) meaning the query planner can't apply families of its internal tricks for better performance, or getting many rows individually instead of as a set which sometimes means a lot of extra work for each row.
There is a definite “we'll worry about that when it happens attitude in both cases which is dangerous. While a live system has practically ground to a halt and the client needs their report by EOP or someone will get it in the neck (and be sure: they will pass that on to you!) is not a good time to be optimising data access, or worse finding out the structure just doesn't support efficient generation of the required data. Another common failing is applying what would idealy be UI or BLL concerns (timezone conversions etc) in the SQL statements in a way that blocks index use.
> Should we think about an alternative, at least for ML ETL workloads?*
I don't work with ML so that is a little outside my day-to-day wexpertise, but I'd wager ETL there has the same problem as everywhere: the basics are all well known and very well optimised for already. The rest differ so much between applications that no one abstraction would be optimal for more than a small portion of real world needs.
I'd be wary of a separate team for optimising queries. I suggest a reasonable understanding in the whole dev team with a data expert embedded who is involved in design work and code reviews so issues are caught early and junior devs can be tutored as needed so by the time they are seniors they don't need the data expert except for really gnarly problems or long-term planning.
Should we think about an alternative, at least for ML ETL workloads?