Hacker News new | past | comments | ask | show | jobs | submit login
The 3-minute SQL indexing quiz: Can you spot the five most common mistakes? (use-the-index-luke.com)
108 points by MarkusWinand on Nov 13, 2017 | hide | past | favorite | 32 comments



Hmm, I got 5 out of 5 for the Postgres, but the questions seem pretty easy for anyone who has dealt with trying to create useful indexes in different databases and debugged slow running queries on an index.

`EXPLAIN` is probably the most useful word in all of SQL. I didn't know how to read `EXPLAIN` results for the first few months of first starting to work with raw SQL for the first time. It made life so much easier when I could just take a query, throw an `EXPLAIN` in front of it, and have an answer as to what was being used to fetch my data.


> the questions seem pretty easy for anyone who has dealt with trying to create useful indexes in different databases and debugged slow running queries on an index.

"Should be obvious to the most dim-witted individual..." https://www.youtube.com/watch?v=j4F8jfeLog4


I got 5/5 as well. My only hesitations were, "Is this a trick question? Like, is there some little-know feature where they've special cased this so that something that doesn't seem like it ought to work actually does?"


I was on the fence for the "not enough information" question because I thought that for outlandish examples I thought there might be a chance the second one was faster, but I thought that the 99.5% case was covered.

I feel kinda robbed.


Yea, I was a little unsure about that one, but realized that the columns in the `SELECT` were the indexed column and a `COUNT` making it index-only scan. Even if the `SELECT` included `b` it still would've slowed down things a tiny bit, but I think I would've chosen "Doesn't really change performance" in that case.


The thing is, SQL Server still needs to get all of those rows from index. Job done in first case, but on the second: Given loop join, for EVERY of those rows returned you must now lookup clustered index row and see if b matches the case or not. Each lookup may take few logical reads, depending on index depth, which depends on table size. Say you have 5 logical reads for tables with millions of entries to lookup clustered index row.

Multiply returned rows which still must be filtered by b. If you get 200 rows, then 200*5 would be 1000 additional logical reads.

The result is the lower the selectivity for parameter a, the slower is the query (orders of magnitudes slower). For a low execution count query it doesn't matter that much, but if that one gets executed hundreds or thousands of times per minute, well, better to have b column either in index key column or included column - that would dramatically speed up things.


Playing devil's advocate here, especially for Postgres this isn't a clear cut case, as because of Postgres' MVCC implementation, depending on whether the rows were already frozen or not (caused by lots of mutations ongoing and not enough VACUUMs), it might have either used an index-only scan (only from 9.2 on and only if most pages are visible-for-all, so the query scanner allows the optimization). In the other cases, it would have had to fetch all rows from disk anyway, in which case "not enough information" would have been true...


Nit-Pick: Freezing of rows isn't relevant here, just whether they're all-visible.


5/5 as well. Even though I have only very limited experience in database design being a project/application manager professionally.


I also got 5 out of 5 for PostgreSQL, although I'll freely admit it was pretty much guessing on question 4.


I would guess that at least half of the programmers where I work would have failed. I wonder how competent developers are at other companies with their SQL databases?

I can tell you that as a production DBA when I'm brought in regarding a performance issue, the stuff that I see makes me wonder how some people remain employed as programmers. Full table scans on 50M row table, no problem. Query the database 10,000 times because you don't want to run one query and parse the results, no problem.

I just want to scream sometimes!!!


If you didn't get a 4 or 5 out of 5, you should go through this website, it's pretty awesome. It explains SQL performance really well, and provides equivalent examples for whatever database you use (including Oracle and DB2 in addition to MySQL and Postgres).


I got 4 out of 5 for mysql. Should have given it more thought :-)

A suggestion: also mention that the correct results with an explanation are shown at the end. It was not clear to me and therefore I decided to answer just quickly. But maybe that is intentional ;-)


How to get 5/5: Visualize a B-Tree that indexes the stated columns in the stated orders. Is the query faster to answer by traversing that B-Tree (where faster means looking up fewer records, or not needing to look them up at all)? Is there a different B-Tree we could build that makes this even faster? If you have problems visualizing a b-tree: any somewhat balanced ordered binary tree has basically the same properties.

Indexes are not complicated (at this level of detail). Maybe the bad results (only 40% got >50% right) are mostly a testament to the state of computer science education around the world.


I'm also getting 5 out of 5 for the Postgres, but I think these questions are helpful. I'm not a DBA, but just guess with my knowledge when I take the database implementation course.


I got 4/5 (background: C++/C# programmer who sometimes works with SQL queries). The one I got wrong was the first one; I thought the query optimizer would be smart enough to spot that the date function could be turned into a range query, but apparently not.

It's interesting to compare this with the 4th question, which more or less does exactly the same thing only with text instead of dates. They both have a WHERE clause that superficially tests the result of a function call but can be easily transformed into a simple range check on an indexed column. I don't know why the optimizer seems to be able to handle one case but not the other.


I got 4/5 on PostgreSQL. I selected "Not enough information" on question 5, which is supposed to be the wrong answer. But I think it's the correct answer.

Depending on the distribution of the data in the table (e.g. all rows have a = 38, but no rows have b = 1), PostgreSQL will chose a Sequential-Scan instead of an Index-Only scan for both queries. This of course leads to the 2nd query being faster due to not having to aggregate anything.

https://gist.github.com/felixge/af97c844cb1f24ac278e0357741c...


When I did the quiz (5/5), I assumed based on the query and the information given that the index was being helpful in the first example, so I assumed it was index scanning.

In that case it does have to recheck the additional condition which would definitely be slower.

However, following your assumption, I would say, both would be about the same speed because it still has to sequence scan the whole table (because more b=1 rows could exist)

Anyways. Seeing that the result could be either of the two answers, I agree that the question is under-specified


I don't agree. The same could be said for the other questions as well.

In quizzes like these its usually assumed the worst possible scenario/dataset.


Good point. But the other questions didn't have a "Not enough information" choice. So I think the question should be improved to avoid false negatives in the results.


I was under the impression that unlike MySQL, Postgres can use two indecies at once when doing a query. So in MySQL you could do a query like `SELECT * FROM tbl WHERE a = 1 ORDER BY b` and you'd want an index on (a, b). But in Postgres you could have a separate index on a and b and you'd get the performance gain. In fact I thought in Postgres land you'd be better off creating them as separate indecies.

Is this info now outdated?


Not's not just outdated, its wrong (in context of question 2) which is about indexing both, `where` and `order by` clauses. In that case you must provide a single index to benefit from the index order (so that the database doesn't actually need to perform a sort operation).


You get some speedup from both indexes, but not the same.

On a related situation, having `WHERE a = 1 AND b = 2` indexes on `a` and `b` just can not have the same performance as an index on `(a, b)`, because you will inevitably have to scan one of the indexes looking for matches. On the case you posted, of an order by, I don't think you get any speedup on the index on `b` at all.

Besides, that `LIMIT 1` at the end of the clause is important to the question.


I failed <100%. shrugs The problem I have with tests like this is it uses syntax I've never used before and I'm not sure how it works. Also the real world involves testing out and seeing what works and what doesn't. For example,

#1 CREATE INDEX tbl_idx ON tbl (date_column) SELECT COUNT() FROM tbl WHERE EXTRACT(YEAR FROM date_column) = 2017

I've never used EXTRACT() in my life, so I don't know if it's index aware. I do know in real life I would write "WHERE date_column >= '2017-1-1' AND date_column <= '2017-12-31' or if I were querying a school_year or something that spanned between two years I would add another column and probably an index on that column and query by that not by the date_column.

#2 CREATE INDEX tbl_idx ON tbl (a, date_column)

SELECT FROM tbl WHERE a = 12 ORDER BY date_column DESC FETCH FIRST 1 ROW ONLY

Who uses FETCH FIRST 1 ROW ONLY instead of LIMIT 1? Also indexes can be ordered by ASC or DESC so there's a possibility of a small optimization there.

#4 CREATE INDEX tbl_idx ON tbl (text varchar_pattern_ops)

SELECT * FROM tbl WHERE text LIKE 'TJ%'

I thought the general philosophy regarding indexes was create the ones you know you need like on foreign keys and very common ones like last_name, SSN etc and then if you notice queries running slow add more and test. This seems like one of those examples. Do you really need an index here? How often are you making this query and what's the speedup gained if you add an index?

#5 CREATE INDEX tbl_idx ON tbl (a, date_column)

SELECT date_column, count() FROM tbl WHERE a = 38 GROUP BY date_column

Let's say this query returns at least a few rows.

To implement a new functional requirement, another condition (b = 1) is added to the where clause:

SELECT date_column, count() FROM tbl WHERE a = 38 AND b = 1 GROUP BY date_column

This seems like another case in the real world where you might test what the slowdown is, and how often you're running this query. If needed you can add another index or run a subquery first, but the answer to this question can be found out in < 10seconds in the real world more quickly by testing it than even spending the time thinking about it.


> If needed you can add another index or run a subquery first, but the answer to this question can be found out in < 10seconds in the real world more quickly by testing it than even spending the time thinking about it.

This is true for some situations. But understanding indexing and databases can save you from headaches that come from structural problems. Not every query performance issue can be solved by throwing more indexes on a table and using it.

It's the idea of prophylaxis. An ounce of prevention is worth a pound of cure.

And if you are using production as a testing environment and waiting for problems in production to tweak your queries, then you probably won't last long in software development.

Indexes are the basics. Everyone should have grasp of it at the basic level.


5/5 for Postgres. I'd chalk that up less to being any kind of Postgres expert and more to having already read through most of that site (and having taken previous versions of the test).


In that case, everything below 5/5 would be very bad ;)


2 out of 5 for Oracle. But I haven't touched SQL in years, so not surprising I don't remember much. I do have to say I would have gotten 2 more correct if I wasn't so impatient :)


Working without JavaScript is cool and I got 2/5


Got 4/5. Assumed that text field is text type, which doesn't let you have index, so it was an obvious bad fit..


5/5 but only because I have ran into these problems on my own in the past.

edit: MySQL


EXPLAIN driven SQL programming is my motto.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: