I don't really know what I am talking about, But I thought that postgres function volatility was introduced for this exact reason, that is, As a clue to the optimizer where it can and cannot push stuff down.
And if you make a function with with a volatility level of stable it does a One-Time Filter. So you will either get 0 or 1000000
create function random2() returns float as $$
begin
return random();
end;
$$ language plpgsql stable;
select count(*) from one_thousand join one_thousand b on random2() > 0.5;
After submitting, I noticed this was submitted a few times without attracting any comments, and it might be because it's a slow burn to get started. It's a neat look at how optimization approaches taken by different databases can lead to very different outcome distributions for the same query.
WITH c AS (
SELECT random() AS r FROM t1
)
SELECT * FROM c WHERE r < 0.5;
I've seen bugs where this returned rows with r > 0.5. The semantics of such functions can be very tricky when the database engine assumes it can freely evaluate everything at any point. Similarly:
SELECT random() AS r FROM t1 WHERE r < 0.5 HAVING r < 0.5;
ANSI SQL doesn't allow this query (projection comes after filtering, so you cannot filter on something that is only created in projection), but some databases do, and can give counterintuitive results.
Well, since random() doesn’t depend on anything, it could also just be evaluated once for the whole query and be treated like a constant. But really, this is simply a case of the language semantics being underspecified.
I don't think they should optimize based solely on table access. What the function does matters as well. Now the optimizer either does not know what the function does(it could be modifying rows, who knows?) or, in the specific case of random() it is intentionally different every single call. The only sane thing to do in this case is avoid moving it around trying to optimize it.
Now if you could mark a function as "pure", that is, it's outputs are strictly dependent on it's inputs, the optimizer would be free to, well, optimize it's location.
I think this is what postgres is trying to do with it's function volatility syntax(volatile, stable, immutable) one of the examples in the article, cockroachdb, has inherited this syntax. but they either drew a different conclusion as to what it means, or they ignore it.
in practice SQL engines will know whether functions are determinsitic or non-deterministic. So with random() it will know it needs to re-evauluate it all the time because its non-deterministic. with left(somecolumn,2) it knows its deterministic so will only re-evaluate it when somecolumn changes
Actually yes now you remind me, SQL Server doesn't recognise Rand() as non-determinstic but it does recognise some functions as being non-deterministic, e.g. if you use newid() you get a new value for every row
Nit: SQLite `random()` returns a full 64-bit signed integer, so `random() < 0.5` in SQLite is (very) slightly different from others. The relevant portion of SQLite source code by the way is the `pushDownWhereTerms` function in select.c [1]; apparently it doesn't do the deterministic function check (i.e. `SQLITE_FUNC_CONSTANT`).
Interesting investigation. I'd argue that the SQL query itself is wrong in that the predicate is in no way related to the tables so the entire thing should be rejected.
and you have (almost because of the inexactness of float computations) the same query, but that isn’t “in no way related to the tables”.
Now, that predicate is biased towards table abc, but that’s correctible:
abc.a + def.d + random()
< abc.a + def.d + 0.5
If the SQL engine is allowed to ignore the subtleties of floats, it can still prove that in this query the predicate isn’t really related to the tables, but it can’t do that in general.
So, you can change the spec to make the simple cases invalid sql, but that won’t get rid of this problem.
You are completely correct, but please see below where I acknowledge that half the problem is that a nondeterministic input - the rand() - is the other half. Hopefully by banning nondeterministic functions in predicates, then requiring predicates to be linked to the tables, that should fix it. Good catch, thanks.
You most definitely should be able to do that, and the SQL standard provides for this (sort of, kind of): TABLESAMPLE.
This applies specifically to exactly one table so the issue of dubious predicate pushdown never arises.
That said, I understand the SQL standard samples at the page level so you get a database-page-worth of results (8k in MSSQL) rather than a properly scattered sample. AIUI the standard allows for any other kind of sampling, but MSSQL doesn't support that (yet) but I believe postgres does. https://stackoverflow.com/questions/49061229/in-postgresql-h...
You cannot reject a predicate just because it doesn't refer to a table. SELECT * FROM t1 WHERE 2+2=4 is a completely valid query and should be allowed.
It depends. For what you're suggesting, we have the cross join/cartesian join. That would be the correct thing to use in this case, not what you've put.
But actually you have a point in that half the problem is the predicate is not linked to the tables, but the other half is the predicate is not predictable, after all it is a RAND(). That compounds the issue.
I'm an idiot sorry about that. Okay, let's try again, suppose you wrote
SELECT * FROM t1 WHERE 2+2=4
Which is semantically identical to
SELECT * FROM t1
While the first is correct in that it has a semantic meaning and a valid output, wouldn't you rather have the system warn you that you've written something redundant? Because it's hardly likely a human would write the former if they meant the latter.
There are _tons_ of SQL queries that involve redundant and/or stupid things. A lot of them come from autogenerated queries where the user has nearly zero control over the actual SQL. An optimizer's job is to optimize that away, not reject them as “please be more efficient”. They are legal queries. All legal queries should be accepted and executed, no matter how silly one may consider them.
As an aside, C compilers generally try to give warnings (not errors) on things like this, but need tons of heuristics to distinguish the “obviously wrong” cases from the “not obviously wrong” cases (e.g., those that arose from inlining and/or constant folding and/or preprocessor expansion and/or dead code removal).
That' a common way to implement (faceted) search where the search expressions are dynamic depending on the user input. With this style, if the user wants to search on one or more fields you simply add "and field = :value" and you don't have to check if it's the first or next search expression.
The 1=1 doesn't hurt, as it will be removed by the query optimizer anyways.
It's also useful to build a query string with where clauses appearing conditionally.
where_clause = "where 1=1"
if video.nsfw:
where_clause += " and age > 18"
if whatever:
where_clause += " and whatever"
query = "select * from user {where_clause}"
Presumably it's to make copy/pasting or otherwise adding/removing entries easy. In the latter case you have to muck about with WHERE vs AND if you modify the top entry whereas if the immediate condition after WHERE is 1=1 then all the real conditions take on the identical form of 'AND x' which can be copy/pasted, reordered, etc without complications.
I have, over the course of my career, found that the teams that tend to do this are pure-SQL teams ie in DB developers who spend all their time writing - and more importantly - reading SQL. Doing this actually removes a lot of debugging and code-fixing friction
This is very enlightening, and it's not surprising that it's not well defined in the spec but it does surprise me that it differs so much between engines.
RANDOM() isn't in the spec at all, which is one of the reasons why the spec doesn't have to worry about such functions. It's just an extension that a bunch of databases happen to implement in similar ways.
https://www.postgresql.org/docs/15/xfunc-volatility.html
I do note that the postgres explain appears to keep the join filter on the top level.