It depends on the commutitvity and associativity of your aggregation operator. W...

stavros · on May 1, 2011

Are you saying you can somehow sum numbers without reading all of them?

barrkel · on May 1, 2011

I'm saying that in terms of real time, if you have enough parallelism, you don't have to wait for n units, but rather log n units. As a function of CPU time, sure it's linear, but ignoring what parallelism gives you is just not being realistic, IMHO.

As to the sibling comment, that's another good point. You say it's because it knows about the queries ahead of time, but you don't get excellent real-world performance without what an academic might consider cheating. Special-case your data structures to expected queries, where those queries are either empirically discovered and adjusted for manually, or learned automatically over time. A closed-minded approach that simply shuts down creative thinking by parroting out "linear time, no can do better, sorry" is poor form.

stavros · on May 1, 2011

This doesn't change the fact that searching is at least as fast as aggregating, no matter how much you cheat. Besides, how is precomputing relevant to the discussion here? By that reasoning, I can convert any algorithm to O(1) time and O(1) space.

The topic is about how search was faster than aggregation, and I'm saying that searching is faster than aggregation generally. You can parallelize all you want, it's still going to be faster to search than to aggregate.

jchrisa · on May 1, 2011

CouchDB does this. It stores precomputed sub-aggregates in the inner-nodes of its B-Tree, so that at query time, it is only summing a handful of numbers instead of all of them.

stavros · on May 1, 2011

It can do this because the queries are known in advance, though...