If you are an application programmer needing to query the database, I wholeheartedly recommend "SQL Performance Explained" by Markus Winand, or his website: https://use-the-index-luke.com/
that was eye-opening indeed. Before reading his blog, I had no clue how indexes worked and how to write SQL that match the index structure (for instance the order of conditions in the where clause matter if you want to leverage a multi-column index).
And IIRC this holds MongoDB and I'd assume other non-SQL DBs.
If you have an index on <companyId, userId>, and you query with a userID only, that index wont be used. But if the index was <userId, companyId> then that index would be used. Or if you supplied both userId and companyId in your query, then either index would work.
sometimes will switch your query from doing a full table scan to using an index, fixing your problem for long enough to prepare a fix to add the appropriate index. Not that I've ever had to do something like that or anything.
> If you have an index on <companyId, userId>, and you query with a userID only, that index wont be used.
Surprisingly, it could be used in some circumstances, just not for the regular seek.
If the index is small (compared to the base table), the DBMS may decide to perform a full index scan (instead of the full table scan), especially if your SELECT list doesn't contain columns which are not in the index.
And Oracle can employ so called "skip scan" if it realizes that the number of distinct companies is small. This is essentially a separate seek under each distinct company.
I have also enjoyed his blog. Here's another way to understand indexes by implementing them in a simple SQL database. This includes both storing/retrieving indexed values and also deciding whether or not a query has an applicable index.
That book was great and I learned a huge amount by reading it. But, unless it's been updated (I read it years ago but it's also an older book), there were numerous typos and even some code errors.
I found the book apprisal somewhere on Reddit in "advanced books on postgres". I expected to find postgres optimization and at least a few pages on how to scale it (or where to look for solutions), instead I was dissapointed.
While it's still a very decent book on SQL syntaxes and advanced queries (I bet even 10y+ PostgreSQL users don't know half of what PostgreSQL can do) here is why I don't recommend it:
- poorly written instructions on how to set up Postgres 11 with all those crazy extensions
- No epub version (it's simply broken and nobody responds on my emails, instead I get a reminder to continue my learning course which has 2 links, a short intro and a link to the book. I've stopped at 50% just because the book was downloaded from another browser and now I get these irritating emails)
- There is a lot of Python in the book, I want to learn SQL and PostgreSQL, not Python or any other sort of languages. I use Py professionally but it does not mean I want it to be tied to SQL.
- There are unrelated interviews, for instance, a guy who worked on Django ORMs (how is this related to my SQL knowledge)
Don't get me wrong, don't treat this comment as a rant about how bad the book is, the SQL tricks are really awesome there, however if you think about buying this book, go for 49$ version and expect to have some pain to setup everything.
Fully agree with you. I bought it, just to discover, that it is somehow just a simple introduction to SQL.
Also what I found irritating: Nowhere on the marketing page of the book I found a reference to the PostgreSQL-version for which it was written or applicable. Most probably because it is only very loosely coupled with PostgreSQL.
If you want something more advanced and fundamental, I'd recommend classic Database System Implementation[0] by Hector Garcia-Molina; Jennifer D. Widom; Jeffrey D. Ullman. It's still relevant. Technologies change, but fundamental principles and algorithms... change order of magnitude slower.
Rob Conery's A Curious Moon (https://bigmachine.io/products/a-curious-moon/) is an interesting read for a PostgreSQL book. I wouldn't say it tries to be an advanced book by design, but I learned a lot of modern SQL from it.
I think the number of people needing truly advanced books is small whereas a lot of people need to jump from beginners to intermediate, or intermediate to advanced.
OMG yes! Andy Pavlo is such a good teacher. I watched his entire lecture series, and it really helped me prepare for an interview. I couldn't recommend his MOOC enough if you really want to know how a database works. I just wish that the homework assignments were available to the public.
Otherwise there is surprisingly a lack of information about architecting databases online and you have to go to the source code of SQLite and PostgreSQL. I suspect this is because databases are some of the most complex systems in computer science and involve many trade secrets.
I have long thought that the best training for database is a course on set theory and group theory, coupled with a close reading of William Kent's "Data and Reality".
The book is a mess, but its structure illustrates its several points.
I am a bit biased here in terms of resources, I worked with a few people to create a free resource that covers intro to sql optimization: https://dataschool.com/sql-optimization
I also created a bunch of animations to help people build mental models for joins, aggregations, window functions, and subqueries which I find is the main blocker for people using SQL optimally: https://dataschool.com/how-to-teach-people-sql/
There is something to be said for having an advanced knowledge of SQL and then carefully staying a very long distance from complicated queries. I can't think of a language where the syntax buckles worse than SQL when things get complicated.
Sometimes it makes sense to roll up the sleeves and do unspeakable things with SQL, but that isn't some sort of win.
Having occasionally wrangled with relativley complicated queries and jungles of views ans procedures, I have to ask that what would be a better way to get complicated stuff out from databases?
I can think of two other options, which I personally both dislike:
1. Some fancy graphical ETL monster which takes ages to learn and where the learned skills are more or less untransferable to anywhere else. And which, at the end, makes same things as raw SQL but just in a more opaque way.
2. Build the complexity outside database. Yes, you avoid complex SQL, but you also lose quite a lot. SQL prohibits you from doing quite a many different stupid mistakes with your data. At least 99 times out of 100 databases have better performance making the complicated calculations than your home brewed solution outside the database (Yes, I agree. The exceptions can be notable...) Finally, reusability of the results is way easier if you keep as much calculations as possible in the database.
But I can't say that I love hairy SQL, so if there are better ways, I would be keen to have a look.
Re: point 2. I'm not necessarily advocating it, but one advantage of doing work outside of SQL is testability. Esp. unit tests. It's still relatively difficult to test SQL, for a variety of reason. I'm working with dbt now, which offers some simple validation out of the box (verify such and such is non-null, for example). More complex tests, such as "verify that some rows/columns have some expected values," require writing a query (returning 0 rows on success, 1 or more on failure.) It works, but still rather clumsy because a) requires an actual database (ie, not really unit testing) and b) writing tests this way is pretty tedious and error prone compared to using a typical assertion framework a la python's unittest.
I totally agree the options are all somewhat unpleasant. I've found dbt to be a nice step in the right direction.
Yep, and another is that version control support of databases is still a bit limited. (There are some solutions, at least redgate is doing something for SQL server on this.) I have a workflow where I recreate all views and procedures in the beginning of the query batch from text files and that gives some ways to version control my views, but as you say, it is clumsy.
I've been having a bit of joy using Dataform [] recently to "modularise" SQL code into more manageable views. It brings some of the best practices of software development to the SQL hellscape, making management of complicated SQL a pleasure.
People do dumb things with SQL, but depending on the query/set of columns, 2k is a pretty quick line count to reach.
Just dont use select *, have a wide table, do more than a few trivial operations moving data around, and you'll get to 2k pretty quick.
That being said, people do dumb stuff with SQL all the time;
The worst query I have ever seen was 11MB, megabytes.
The worst query plan I have ever seen generated a plan that was ~130MB.
I regularly have people ask questions in #sqlhelp (https://sqlps.io/slack/ SQL Server focused) like "Why can I only have 2100 parameters to a stored procedure?" (dude was super mad that I asked if his design might use some rework.)