Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Another postgresql performance gotcha:

Find all the coupons that are expired (90 day expiration):

    SELECT * FROM coupon
    WHERE created_at + INTERVAL '90 DAY' < now()
This will not use the index on the "created_at" column and will be slow.

You should rewrite the inequality to:

    SELECT * FROM coupon
    WHERE created_at < now() - INTERVAL '90 DAY'
and now the query will be much faster. There are a lot of cases in postgres where simple equivalent algebraic manipulations can completely change the query plan


They are not equivalent since `created_at + INTERVAL '90 DAY'` can overflow for every single row whereas `now() - INTERVAL '90 DAY'` is a constant for the purpose of the query execution.


Yes - this is a common restriction in any DB I've used, certainly in MS SQL Server. The idea is that your queries need to be "SARGable": https://en.wikipedia.org/wiki/Sargable


The DB we use (SQLAnywhere) doesn't consider now() a constant either, so no indexes considered just due to that alone #thingsilearnedinproduction


Wow, that's a dealbreaker for me.


But that would never be desirable, so it's just another reason to do the other?


Yeah this seems very logical to me. I wouldn't call it a "gotcha".


What does "can overflow for every single row" mean in this context?


The maximum value in a postgres timestamp is `294276-12-31 23:59:59.999999`. Overflow means that `created_at + interval '90 days'` exceeds this value. This causes an error.


Ah, so a traditional overflow. It's just that it didn't come to my mind that dates this huge would actually be used, so I was wondering if it was some other thing being referred to. Thanks.


I was wondering the same thing, but after staring at it a bit I think the problem is that one of them is doing math on the values from every row and the other is doing the math once.

created_at + INTERVAL '90 DAY' < now() says that for every row take the created_at column, add 90 days to it, and then see if it is less than now()

created_at < now() - INTERVAL '90 DAY' says take now() subtract 90 days, and then see which rows are less than the result.

Atleast, thats my guess. I rarely do any db stuff directly.


Is "overflow" a term used to express "computed for every row"?

I can see where the optimization would come from, when comparing `created_at` with a fixed value `now() - INTERVAL...` (assuming PostgreSQL is smart enough to evaluate it only once and reuse it for all the index comparisons), but the word "overflow" throws me out of the lane.


This smells less like an optimization developers should make and more like a bug or low-hanging-fruit improvement to the engine.


These 2 predicates are totally different.

The predicate in the 1st statement is actually an expression "created_at + INTERVAL '90 DAY'", it's not column "created_at".

Some databases allow users to create indexes on expression. So if you want to write the 1st statement, you need an index on expression, not a normal index.


Wow, is there any public list or documentation about these common cases and how to make them faster in PG? I would expect the PG query optimizer to fix this automatically, but as it doesn't, having this documentation would be of great use for many developers. Thanks for sharing!


Sites like https://use-the-index-luke.com/ capture a lot of wisdom around tuning. But IMO, it's easier to learn from doing.

So write your product, then start monitoring it as you release it to production.

Postgres can track aggregate metrics for queries using the pg_stat_statements extension [1]. You then monitor this periodically to find queries that are slow, then use EXPLAIN ANALYZE [2] to dig in. Make improvements, then reset the statistics for the pg_stat_statements view and wait for a new crop of slow queries to arise.

[1]: https://www.postgresql.org/docs/current/pgstatstatements.htm...

[2]: https://www.postgresql.org/docs/current/using-explain.html


When releasing a new application (or feature) I've always loaded each table in my development environments database with a few million rows. Tools like Python's Factory Boy [1] or Ruby's Factory Bot [2] help get the data loaded.

After the data is loaded up, start navigating through the application and it will become evident where improvements need to be made. Tools like Django Debug Toolbar [3] help expose where the bad ORM calls are or also by tailing Postgres log files.

[1] https://github.com/FactoryBoy/factory_boy

[2] https://github.com/thoughtbot/factory_bot

[3] https://github.com/jazzband/django-debug-toolbar


It can't "fix" it because it isn't broken; they are not the same predicate.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: