Hacker News new | past | comments | ask | show | jobs | submit login

“Think, before you DISTINCT”

(Not mine, and I don’t know the source)




I don't. I put distinct on everything by default, because it makes SQL more like relational algebra.

What's the argument for avoiding distinct? Is it performance? I've often wondered if the (sometimes redundant) distincts slow things down. Perhaps it does, but just not enough that I noticed.


Performance implication exist, but it is secondary.

Primary reason: distinct on every select shows either lack of knowledge of schema, in particular which columns make rows unique, or unfortunate schema design. (Apart from niche cases, schema should be somewhat normal. I.e. column parent_name belongs in the table parent, not in the table student)

Select a from x where myuniquekey=1; —- guaranted to return 1 or zero rows, if myuiniquekey is actually unique.

Select a from x join y on x.parent_id = y.y_id —- guaranteed to return same amount of rows as exist in y, never more, never duplicates y rows. (N-to-1 relation)

If distinct is used in any of above, then question “why?” naturally arises.

In more severe case, leads to bugs:

Select distinct student.student_name, parent.parent_name from student join parent on student.parent_id = parent.parent_id —- silently discards rows, where by accident student/parent name combo matches several times.

Technically sql allows comparing unrelated columns (colour=last_name), but for vast majority of cases, when joining, one of the side should be joined using it’s unique key, and other side should be using it’s foreign key, which ensures that duplicates don’t appear randomly, and thus distinct is not needed.


> If distinct is used in any of above, then question “why?” naturally arises.

Not if distinct is the default.

> Select distinct student.student_name, parent.parent_name from student join parent on student.parent_id = parent.parent_id —- silently discards rows, where by accident student/parent name combo matches several times.

Either with or without distinct can be a bug depending on what you are doing it for.

There are actually 4 variations on what you might want, and you can get all of them with distinct:

  select distinct student.student_id, student.student_name, parent.parent_id, parent.parent_name from ...
  select distinct student.student_name, parent.parent_id parent.parent_name from ...
  select distinct student.student_id, student.student_name, parent.parent_name from ...
  select distinct student.student_name, parent.parent_name from ...


Our main application at work is essentially a CRUD application, and I've worked on it for over 10 years now. I'm fairly confident I can count on one hand the number of cases where a join returned unexpected duplicates which DISTINCT would "fix".

Sometimes I wonder if we're just weird, somehow avoiding this issue.


These examples reminded me one more issue: change in column selection, might change number of rows, which means column addition/removal is so much riskier afair.

> Not if distinct is the default.

If that works for you, great, but let’s agree to disagree here.


We have different mental models.

Your mental model, if you will forgive the straw man, is that SELECT over multiple tables is conceptually equivalent to nested for-loops over each table, and the WHERE condition is an if-statement.

My mental model is that I'm working with sets. If yesterday I asked for the set of CITY,COUNTRY, and today I've changed that to the set of COUNTRY, then obviously the result set today is going to be much smaller. This is not a risk to me -- asking for a different set gives me a different set, I can't imagine being surprised by that.


The article clearly explains, as do a number of comments here.


If it was that clear, then I wouldn't have asked.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: