Find all the coupons that are expired (90 day expiration):
SELECT * FROM coupon
WHERE created_at + INTERVAL '90 DAY' < now()
This will not use the index on the "created_at" column and will be slow.
You should rewrite the inequality to:
SELECT * FROM coupon
WHERE created_at < now() - INTERVAL '90 DAY'
and now the query will be much faster. There are a lot of cases in postgres where simple equivalent algebraic manipulations can completely change the query plan
They are not equivalent since `created_at + INTERVAL '90 DAY'` can overflow for every single row whereas `now() - INTERVAL '90 DAY'` is a constant for the purpose of the query execution.
Yes - this is a common restriction in any DB I've used, certainly in MS SQL Server. The idea is that your queries need to be "SARGable": https://en.wikipedia.org/wiki/Sargable
The maximum value in a postgres timestamp is `294276-12-31 23:59:59.999999`. Overflow means that `created_at + interval '90 days'` exceeds this value. This causes an error.
Ah, so a traditional overflow. It's just that it didn't come to my mind that dates this huge would actually be used, so I was wondering if it was some other thing being referred to. Thanks.
I was wondering the same thing, but after staring at it a bit I think the problem is that one of them is doing math on the values from every row and the other is doing the math once.
created_at + INTERVAL '90 DAY' < now() says that for every row take the created_at column, add 90 days to it, and then see if it is less than now()
created_at < now() - INTERVAL '90 DAY' says take now() subtract 90 days, and then see which rows are less than the result.
Atleast, thats my guess. I rarely do any db stuff directly.
Is "overflow" a term used to express "computed for every row"?
I can see where the optimization would come from, when comparing `created_at` with a fixed value `now() - INTERVAL...` (assuming PostgreSQL is smart enough to evaluate it only once and reuse it for all the index comparisons), but the word "overflow" throws me out of the lane.
The predicate in the 1st statement is actually an expression "created_at + INTERVAL '90 DAY'", it's not column "created_at".
Some databases allow users to create indexes on expression. So if you want to write the 1st statement, you need an index on expression, not a normal index.
Wow, is there any public list or documentation about these common cases and how to make them faster in PG? I would expect the PG query optimizer to fix this automatically, but as it doesn't, having this documentation would be of great use for many developers. Thanks for sharing!
So write your product, then start monitoring it as you release it to production.
Postgres can track aggregate metrics for queries using the pg_stat_statements extension [1]. You then monitor this periodically to find queries that are slow, then use EXPLAIN ANALYZE [2] to dig in. Make improvements, then reset the statistics for the pg_stat_statements view and wait for a new crop of slow queries to arise.
When releasing a new application (or feature) I've always loaded each table in my development environments database with a few million rows. Tools like Python's Factory Boy [1] or Ruby's Factory Bot [2] help get the data loaded.
After the data is loaded up, start navigating through the application and it will become evident where improvements need to be made. Tools like Django Debug Toolbar [3] help expose where the bad ORM calls are or also by tailing Postgres log files.
Find all the coupons that are expired (90 day expiration):
This will not use the index on the "created_at" column and will be slow.You should rewrite the inequality to:
and now the query will be much faster. There are a lot of cases in postgres where simple equivalent algebraic manipulations can completely change the query plan