How PostgreSQL aggregation works and how it inspired our hyperfunctions’ design

codeismath · on Aug 5, 2021

The continuous aggregates portion of this blog (along with the breakdown of Transition, Combine, and Final Functions) reminded me of "A Theory of Changes for Higher-Order Languages: Incrementalizing Lambda-Calculi by Static Differentiation" (Giarrusso et. al.) [0].

Particularly the part of getting logically-consistent results via different routes of computation. David Kohn says this in the blog post: "But, you have to have in-depth (and sometimes rather arcane) knowledge about each aggregate’s internals to know which ones meet the above criteria – and therefore, which ones you can re-aggregate."

I think Giarrusso addresses the internal requirements in a precise mathematical manner (though I may be wrong). Giarrusso Section 2.1 is the specific area I'm thinking, particularly Definition 2.1 and then later the discussion of the abelian group where certain binary operations must be commutative and associative. Giarrusso Equation 1 in the introducton defines externally what it means to be logically consistent.

Also, Giarrusso talks about "Self-Maintainability" where updating the continuous result only needs to look at the changes to the input. What was obvious to me from reading Giarrusso was that something simple like "AVG" is not self-maintainable unless you keep the intermediate state (sum and count) seperately. Kohn's distinction of Transition Function, Combine Function, and Final Function - together with Giarrusso's abelian group and change structure - is kind of a big deal in making arbitrary functions "continuous"-capable while being "self-maintainable".

I can see designing a [data] structure that adheres to the specific algabraic laws defined by Giarrusso's abelian group along with the 3 functions from Kohn (missing from Giarrusso). You can then feed this structure to a function that spits out two new functions for you: a normal function that can be used to calculate a result, and it's "continuous" version which only needs change to the inputs. For example, "avg()" and "avg_continuous()" can be automatically derived from a data structure of algebraically well-behaved operations.

Plus this would have a really cool Category Theory diagram for it!

David: absolutely love the animated gifs and pictures. Very well written indeed.

[0] Giarrusso et. al. https://www.researchgate.net/publication/269126515_A_theory_...

thom · on Aug 5, 2021

Thanks for this comment and link, I think this area is one of the most interesting frontiers in computer science today. We're starting to see products like here in TimescaleDB and elsewhere in Materialize that build on these ideas, but in five or ten year's time the developer story here is going to be absolutely incredible. There is so much potential in being able to write declarative specifications for very complex stateful dataflows, that are then just maintained quickly and automagically, especially if we can do it without arbitrary windows and watermarks. Very exciting!

nixpulvis · on Aug 5, 2021

I'll have to read this paper, it seems interesting.

This problem also reminds me of this: https://jon.thesquareplanet.com/papers/phd-thesis.pdf

djk447 · on Aug 5, 2021

(NB: Post author here)

Glad you liked the GIFs! Will have to take a look at this. Thanks for sharing!

sharps_xp · on Aug 6, 2021

I also like the GIFs, which tools did you use to make these?

LoriP · on Aug 9, 2021

The GIFs were made with Adobe XD

sa1 · on Aug 6, 2021

This is the most interesting paper I’ve been pointed to in a while.

ecnahc515 · on Aug 5, 2021

This sounds a lot like how Prometheus works.

paulddraper · on Aug 5, 2021

> Continuous aggregation

I found this to be a very common need, so I created denorm [1] for doing this in vanilla PostgreSQL. (It also does incrementally updated joins as well.)

It's a real help to instant aggregation results for large data sets.

(I expect the performance of a native implementation like Timescale to be superior. Unfortunately, I use managed database services, like RDS.)

[1] https://github.com/rivethealth/denorm

nextaccountic · on Aug 6, 2021

Hey, that seems very cool!

I saw it is written in Python. Does the Python code run only when creating the sql query? Am I supposed to add the generated sql code to git?

paulddraper · on Aug 6, 2021

Correct. The Python code generates the triggers and functions to keep the rollup table up-to-date.

Add the SQL DDL to whatever migration control system you use.

nlehuen · on Aug 5, 2021

You can see the same pattern of two-step aggregation with the HLL_COUNT family of functions in BigQuery: https://cloud.google.com/bigquery/docs/reference/standard-sq...

This is really useful for this kind of metric (distinct count) that can't be trivially aggregated. It allows generating pre-aggregated tables containing the intermediate aggregation state. Now you can compute your DAUs, WAUs and MAUs from the same daily pre-aggregates.

I wish this design was generalized in SQL. For instance, getting the same family of function for variance computation would be super useful for efficient & correct computation of confidence intervals at different levels of aggregation.

djk447 · on Aug 6, 2021

Absolutely! We're actually developing a lot of that: https://github.com/timescale/timescaledb-toolkit/tree/main/d...

A number of the things you're looking for we've done experimentally and we'll be stabilizing over the next few releases. So we'd love some feedback while we're still able to futz with the API without making breaking changes.

But the two you're asking about are, I think, going to be covered by hyperloglog (we just reimplemented the internals with HLL++) and stats_agg family of functions, which have both 1D (which will give you avg, stddev, variance, etc) and 2D (co-variance, slope, intercept, x-intercept etc as well as all the 1D functions).

Would also love issues if you think we're missing other stuff, going to be generalizing this and want to make it useful for folks.

(NB: Post author here.)

nlehuen · on Aug 7, 2021

The stats_agg + extractor function pattern is exactly what I was thinking about in my last paragraph :-). Very cool!

infogulch · on Aug 6, 2021

This seems similar to differential datalog (DDlog)

https://hexgolems.com/2020/10/getting-started-with-ddlog/

https://github.com/vmware/differential-datalog

djk447 · on Aug 6, 2021

Hadn't heard of this, but really interesting, thanks for posting! Will have to see if we can use that somehow...

(NB: Post author here.)

noctarius · on Aug 5, 2021

That is a pretty good and detailed blog post. I wonder, however, if I can use any aggregate call to build out a continuous aggregate. I'm specifically think about things like sketches and hyperloglogs.

djk447 · on Aug 5, 2021

(NB: Post author here)

Yes! You can use any of the hyperfunction aggregates to build a continuous agg. Including sketches like the percentile approximation stuff.

With hyperloglog, it's still experimental, so you have to jump through some hoops to get it into a continuous agg as it could be dropped on next release, but once that's been stabilized it'll be very easy to use in continuous aggs. Feel free to test now, just know that it may get dropped and you'd have to re-create the continuous agg after the next release when it gets stabilized.

sgtfrankieboy · on Aug 5, 2021

Interesting article!

I was wondering if there plans for supporting Continuous Aggregates on Distributed hypertables? Its currently listed as not supported under the limitations docs page.

I think the two would fit together perfectly as it will allow users to aggregate very large tables.

djk447 · on Aug 5, 2021

Yes! They do fit together very well.

There are plans to do that, I'm not sure the exact timeline, I think it's in the design phase now.

We do already support partitionwise aggregation, and this meshes very well with that, so that you're sending much, much less data back over the network between the data node and access node when you're trying to do things like percentiles. But I agree that continuous aggregate support will be an even bigger win!

mfreed · on Aug 5, 2021

We plan to release support for continuous aggregates on distributed hypertables in the next month or two.

Initially version will support case where the raw data is a distributed hypertable, while the cagg table is a hypertable on the access node. Then follow-up with support where the cagg table can be distributed as well.

We are simultaneously working on support for compressed continuous aggregates as well, which we are targeting the same timeframe.

oconnor663 · on Aug 5, 2021

Noob question: Can you have time-series tables and regular Postgres tables in the same database? Is this distinction something that you decide on when you create a table?

noctarius · on Aug 5, 2021

Yes you can have both in the same database. We actually use TimescaleDB exactly that way, well at least for now. We'll in the process of separating certain tables out, so make it possible to select a pg database based on async or sync replication.

In terms of creating a Hypertable (TimescaleDB table) you create the normal table first and then transform it into a Hypertable.

clarkbw · on Aug 5, 2021

> ... make it possible to select a pg database based on async or sync replication

Can you explain why you doing this? Is this application data vs. time-series?

takeda · on Aug 5, 2021

Not op, but choosing async vs sync replication is typically done around guarantees. Async (the default mode) is generally preferred in most cases as it adds relatively small overhead. The problem with it, is that if something happens to master, the standby node might be behind (i.e. some data might be lost).

Sync on the other hand won't return from the transaction until the change is replicated.

For time series, generally async is acceptable, and even for regular data most people might also be fine with async (that's why it is the default). Sync is used if you absolutely aren't allowing any data loss. But because it waits for all nodes to confirm they wrote data to the disk writing will always be slower.

The problem is that if you use physical replication it applies to all data in the database (the logical resource not the physical server), so my guess is that some of the non-time series data has higher durability requirements.

claytonjy · on Aug 5, 2021

Yes you can, and yes you make the decision table by table. 'CREATE TABLE` is the same in either case, then `create_hypertable` to make it a timeseries table https://docs.timescale.com/api/latest/hypertable/create_hype...

miohtama · on Aug 5, 2021

I have been using TimescaleDB with Python, SQLAlchemy for a mixed timeseries and normal relational workload. The only issue I have encountered so far is that pq_restore needs some special, easy, wrapping. TimescaleDB is just an extension to PSQL.

GordonS · on Aug 5, 2021

Yes! It's a nice benefit of TimescaleDB being built on top of Postgres.

pvorb · on Aug 5, 2021

And it's the feature that sets Timescale apart from other, NoSQL, time-series databases. With Timescale you can have all your data in the same database and join regular tables against hypertables or do other useful stuff directly in the database.

throwdbaaway · on Aug 6, 2021

I don't quite understand how is it possible to rollup percentile_agg from 15 minutes buckets to 1 day buckets. Won't the resulting values be off by a lot?

djk447 · on Aug 6, 2021

Under the hood we use the uddsketch percentile approximation algorithm [1], which is able to both be combined and also provides relative error guarantees. I'll also be writing a post about our percentile approximation stuff in the future that will go into more detail on this, though I'm not going to go through all the math. The basic gist is it uses a histogram with logrithmically sized buckets to give some relative error guarantees.

(NB: Post author here)

[1]: https://docs.timescale.com/api/latest/hyperfunctions/percent...