Hacker News new | past | comments | ask | show | jobs | submit login

> There's no reason for SELECT COUNT(DISTINCT x)) to perform badly in Postgres, as long as you have an appropriate table design and indexes.

Meh. Postgres' planner doesn't know how to generate a skip-scan/loose index scan for DISTINCT. You can write it yourself, but it's a bit painful: https://wiki.postgresql.org/wiki/Loose_indexscan

If you have a low cardinality that can be a huge efficiency difference.




Actually that's wrong:

    EXPLAIN ANALYZE SELECT COUNT(DISTINCT(calc)), calc FROM price_history GROUP BY calc;
                                                          QUERY PLAN                                                           
    -------------------------------------------------------------------------------------------------------------------------------
     GroupAggregate  (cost=10713.04..11381.06 rows=10 width=5) (actual time=1010.383..1073.263 rows=11 loops=1)
       Group Key: calc
       ->  Sort  (cost=10713.04..10935.68 rows=89056 width=5) (actual time=1010.321..1049.189 rows=89041 loops=1)
             Sort Key: calc
             Sort Method: external merge  Disk: 1392kB
             ->  Seq Scan on price_history  (cost=0.00..3391.56 rows=89056 width=5) (actual time=0.007..20.516 rows=89041 loops=1)
     Planning time: 0.074 ms
     Execution time: 1076.521 ms
    (8 rows)

With Index:

    EXPLAIN ANALYZE SELECT COUNT(DISTINCT(calc)), calc FROM price_history GROUP BY calc;
                                                                         QUERY PLAN                                                                          
    -------------------------------------------------------------------------------------------------------------------------------------------------------------
     GroupAggregate  (cost=0.29..2804.82 rows=10 width=5) (actual time=0.117..47.381 rows=11 loops=1)
       Group Key: calc
       ->  Index Only Scan using price_history_calc_idx on price_history  (cost=0.29..2359.52 rows=89041 width=5) (actual time=0.054..18.579 rows=89041 loops=1)
             Heap Fetches: 83
     Planning time: 0.208 ms
     Execution time: 47.416 ms
    (6 rows)

Actually that is called a index only scan, and happens when you have a data type that is inside your index. Which means if you need a aggregate you could try to index everything you need. Mostly a aggregate only contains some values of a row so a index is mostly not a problem.


I didn't say an index couldn't be used at all. Just not to actually make the query fast. This will get all duplicates for a value from the index, before going to the next value. If you have a couple thousand or more of each to be counted value that'll make the query rather slow.


An index only scan isn't the same as a loose index scan. They're orthogonal tricks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: