Hacker News new | past | comments | ask | show | jobs | submit login
The Case of a Curious SQL Query (buttondown.email/jaffray)
184 points by petercooper on Oct 25, 2023 | hide | past | favorite | 40 comments



I don't really know what I am talking about, But I thought that postgres function volatility was introduced for this exact reason, that is, As a clue to the optimizer where it can and cannot push stuff down.

https://www.postgresql.org/docs/15/xfunc-volatility.html

I do note that the postgres explain appears to keep the join filter on the top level.

  explain select count(*) from
   generate_series(0,999) a
     inner join
   generate_series(0,999)
     on random() < 0.5
  ;
                                      QUERY PLAN                                      
  --------------------------------------------------------------------------------------
   Aggregate  (cost=25843.34..25843.35 rows=1 width=8)
     ->  Nested Loop  (cost=0.01..25010.01 rows=333333 width=0)
          Join Filter: (random() < '0.5'::double precision)
          ->  Function Scan on generate_series a  (cost=0.00..10.00 rows=1000 width=0)
         ->  Function Scan on generate_series  (cost=0.00..10.00 rows=1000 width=0)


And if you make a function with with a volatility level of stable it does a One-Time Filter. So you will either get 0 or 1000000

  create function random2() returns float as $$
  begin
      return random();
  end;
  $$ language plpgsql stable;

  select count(*) from one_thousand join one_thousand b on random2() > 0.5;
Query Plan

  Aggregate  (cost=97615.13..97615.14 rows=1 width=8) (actual time=0.028..0.029 rows=1 loops=1)
    ->  Result  (cost=0.25..81358.88 rows=6502500 width=0) (actual time=0.026..0.027 rows=0 loops=1)
          One-Time Filter: (random3() > '0.5'::double precision)
          ->  Nested Loop  (cost=0.25..81358.88 rows=6502500 width=0) (never executed)
                ->  Seq Scan on one_thousand  (cost=0.00..35.50 rows=2550 width=0) (never executed)
                ->  Materialize  (cost=0.00..48.25 rows=2550 width=0) (never executed)
                      ->  Seq Scan on one_thousand b  (cost=0.00..35.50 rows=2550 width=0) (never executed)


After submitting, I noticed this was submitted a few times without attracting any comments, and it might be because it's a slow burn to get started. It's a neat look at how optimization approaches taken by different databases can lead to very different outcome distributions for the same query.


This was a great read, super interesting, and I hadn't seen anything like this article before.


I'm glad you posted this. I haven't seen it here (or anywhere else) before, and it was a fun read.


> SQLite plans are unfortunately not the most enlightening

> sqlite> explain select

About 5 years ago, sqlite introduced EXPLAIN QUERY PLAN to get something more like what other databases give with just EXPLAIN.

https://www.sqlite.org/eqp.html


I had no idea, thanks!


Other tricky cases:

WITH c AS ( SELECT random() AS r FROM t1 ) SELECT * FROM c WHERE r < 0.5;

I've seen bugs where this returned rows with r > 0.5. The semantics of such functions can be very tricky when the database engine assumes it can freely evaluate everything at any point. Similarly:

SELECT random() AS r FROM t1 WHERE r < 0.5 HAVING r < 0.5;

ANSI SQL doesn't allow this query (projection comes after filtering, so you cannot filter on something that is only created in projection), but some databases do, and can give counterintuitive results.


Well, since random() doesn’t depend on anything, it could also just be evaluated once for the whole query and be treated like a constant. But really, this is simply a case of the language semantics being underspecified.


I don't think they should optimize based solely on table access. What the function does matters as well. Now the optimizer either does not know what the function does(it could be modifying rows, who knows?) or, in the specific case of random() it is intentionally different every single call. The only sane thing to do in this case is avoid moving it around trying to optimize it.

Now if you could mark a function as "pure", that is, it's outputs are strictly dependent on it's inputs, the optimizer would be free to, well, optimize it's location.

I think this is what postgres is trying to do with it's function volatility syntax(volatile, stable, immutable) one of the examples in the article, cockroachdb, has inherited this syntax. but they either drew a different conclusion as to what it means, or they ignore it.


in practice SQL engines will know whether functions are determinsitic or non-deterministic. So with random() it will know it needs to re-evauluate it all the time because its non-deterministic. with left(somecolumn,2) it knows its deterministic so will only re-evaluate it when somecolumn changes


> So with random() it will know it needs to re-evauluate it all the time because its non-deterministic.

This is often not true. SQL Server for one, and I know it isn't the only common DB that behaves this way:

  SELECT RAND() AS NotSoRandom
       , RAND() AS NotSoRandomAgain 
    FROM SomeTable
results in:

  NotSoRandom         NotSoRandomAgain
  0.417713566427078   0.417713566427078
  0.417713566427078   0.417713566427078
  0.417713566427078   0.417713566427078
  0.417713566427078   0.417713566427078
  ...
This often confuses people trying to select random rows (or all results in a random order) with ORDER BY RAND().


Actually yes now you remind me, SQL Server doesn't recognise Rand() as non-determinstic but it does recognise some functions as being non-deterministic, e.g. if you use newid() you get a new value for every row



Nit: SQLite `random()` returns a full 64-bit signed integer, so `random() < 0.5` in SQLite is (very) slightly different from others. The relevant portion of SQLite source code by the way is the `pushDownWhereTerms` function in select.c [1]; apparently it doesn't do the deterministic function check (i.e. `SQLITE_FUNC_CONSTANT`).

[1] https://www.sqlite.org/src/info/64c9bc7494f3d220a27498137551...


whoops embarrassing. thank you for pointing that out!


Interesting investigation. I'd argue that the SQL query itself is wrong in that the predicate is in no way related to the tables so the entire thing should be rejected.


Replace

  random() < 0.5
by

  abc.a + random() < abc.a + 0.5
and you have (almost because of the inexactness of float computations) the same query, but that isn’t “in no way related to the tables”.

Now, that predicate is biased towards table abc, but that’s correctible:

  abc.a + def.d + random()
      < abc.a + def.d + 0.5
If the SQL engine is allowed to ignore the subtleties of floats, it can still prove that in this query the predicate isn’t really related to the tables, but it can’t do that in general.

So, you can change the spec to make the simple cases invalid sql, but that won’t get rid of this problem.


You are completely correct, but please see below where I acknowledge that half the problem is that a nondeterministic input - the rand() - is the other half. Hopefully by banning nondeterministic functions in predicates, then requiring predicates to be linked to the tables, that should fix it. Good catch, thanks.


Why shouldn't I be able to randomly sample a table? Some common use cases include queues and A/B tests.


You most definitely should be able to do that, and the SQL standard provides for this (sort of, kind of): TABLESAMPLE.

This applies specifically to exactly one table so the issue of dubious predicate pushdown never arises.

That said, I understand the SQL standard samples at the page level so you get a database-page-worth of results (8k in MSSQL) rather than a properly scattered sample. AIUI the standard allows for any other kind of sampling, but MSSQL doesn't support that (yet) but I believe postgres does. https://stackoverflow.com/questions/49061229/in-postgresql-h...


Out of curiosity, is there a list of built-in non-deterministic SQL functions?


In Postgres, run:

select proname from pg_catalog.pg_proc where provolatile='v';


You cannot reject a predicate just because it doesn't refer to a table. SELECT * FROM t1 WHERE 2+2=4 is a completely valid query and should be allowed.


It depends. For what you're suggesting, we have the cross join/cartesian join. That would be the correct thing to use in this case, not what you've put.

But actually you have a point in that half the problem is the predicate is not linked to the tables, but the other half is the predicate is not predictable, after all it is a RAND(). That compounds the issue.


I have no idea what a Cartesian product would have to do with my query; there's only a single table there.

> But actually you have a point in that half the problem is the predicate is not linked to the tables

No, it really isn't a problem at all that the predicate isn't linked to the tables. That in itself is entirely legal and unproblematic.


I'm an idiot sorry about that. Okay, let's try again, suppose you wrote

SELECT * FROM t1 WHERE 2+2=4

Which is semantically identical to

SELECT * FROM t1

While the first is correct in that it has a semantic meaning and a valid output, wouldn't you rather have the system warn you that you've written something redundant? Because it's hardly likely a human would write the former if they meant the latter.


There are _tons_ of SQL queries that involve redundant and/or stupid things. A lot of them come from autogenerated queries where the user has nearly zero control over the actual SQL. An optimizer's job is to optimize that away, not reject them as “please be more efficient”. They are legal queries. All legal queries should be accepted and executed, no matter how silly one may consider them.

As an aside, C compilers generally try to give warnings (not errors) on things like this, but need tons of heuristics to distinguish the “obviously wrong” cases from the “not obviously wrong” cases (e.g., those that arose from inlining and/or constant folding and/or preprocessor expansion and/or dead code removal).


It's not my style, but a previous team would always add a where-clause starting with

  WHERE 1=1
Because then any and all query predicates would be nicely aligned on the following lines, like this:

  WHERE 1=1
      AND t1.col1 = 1
      AND t2.col2 = 1
(Why they couldn't just lower the indent to read

  WHERE t1.col1 = 1
    AND t2.col2 = 1
Is beyond me)

Point being, don't assume that humans will never write an always-true predicate.


That' a common way to implement (faceted) search where the search expressions are dynamic depending on the user input. With this style, if the user wants to search on one or more fields you simply add "and field = :value" and you don't have to check if it's the first or next search expression.

The 1=1 doesn't hurt, as it will be removed by the query optimizer anyways.


It's also useful to build a query string with where clauses appearing conditionally.

  where_clause = "where 1=1"
  if video.nsfw:
      where_clause += " and age > 18"
  if whatever:
      where_clause += " and whatever"
  
  query = "select * from user {where_clause}"


Presumably it's to make copy/pasting or otherwise adding/removing entries easy. In the latter case you have to muck about with WHERE vs AND if you modify the top entry whereas if the immediate condition after WHERE is 1=1 then all the real conditions take on the identical form of 'AND x' which can be copy/pasted, reordered, etc without complications.


100% this: part of this crew.

I have, over the course of my career, found that the teams that tend to do this are pure-SQL teams ie in DB developers who spend all their time writing - and more importantly - reading SQL. Doing this actually removes a lot of debugging and code-fixing friction


suprisingly often in real situations when I just want to get the columns of a table with no data I have done

select * from table where (1=0)


why not just `select * from table where false` ?


some databases such as Oracle do not support boolean literals or boolean at all as a user accessible type


SQL Server doesn't have boolean literals like that, hence the (1=0) trick


I think that would probably be best, unless this type of thing is well documented (although I don't see it entering ANSI SQL).


This is very enlightening, and it's not surprising that it's not well defined in the spec but it does surprise me that it differs so much between engines.


RANDOM() isn't in the spec at all, which is one of the reasons why the spec doesn't have to worry about such functions. It's just an extension that a bunch of databases happen to implement in similar ways.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: