Real-time event aggregation at scale using Postgres with Citus

ahachete · on Nov 29, 2016

The whole idea about the post is quite interesting. Definitely shows the high performance capabilities of the system. I really appreciate the addition of the parallel insert via INSERT ... SELECT which is quite handy.

I have a question about the topic of the post. The method presented could also be simulated with a MATERIALIZED VIEW? Are they supported in Citus? In any case, the technique presented is pretty cool :)

buremba · on Nov 29, 2016

Materialized view is already supported by Postgresql but it requires full table re-write: https://www.postgresql.org/docs/9.3/static/sql-createmateria... We're also experimenting incremental materialized view with PrestoDB. If you have a incremental column in a append-only table or a server_time column for each row. (https://rakam.io/doc/buremba/rakam-wiki/master/Analyze-Data#...) One possible way to implement incremental materialized views in Postgresql is logical decoding: https://www.postgresql.org/docs/9.4/static/logicaldecoding-e...

ozgune · on Nov 29, 2016

(Ozgun from Citus Data)

We're planning on supporting VIEWS (but not MATERIALIZED VIEWs) in Citus 6.1: https://github.com/citusdata/citus/issues/442

If you're interested in using Materialized Views in Citus, please feel free to comment on the issue or open a new one. We'd be happy to incorporate your feedback into upcoming releases.

You're also correct in that the method presented in this blog post could be simulated with a distributed materialized view. One potential difference could be that materialized views in PostgreSQL currently refresh the entire data from source tables. This method provides flexibility to do rolling inserts or upserts.

ahachete · on Nov 29, 2016

+1 for VIEW support :)

I agree that the suggested method is more flexible (which means more performance in this kind of environments/problems). Mat views provide simplicity. I'd say not a must, but they may be useful for users who don't want to dig deep into these techniques.

georgewfraser · on Nov 30, 2016

Pre-aggregating data like this is a last resort in data warehousing. It's fragile: in this example, it depends on your data coming in exactly-once, in-order. It's also the opposite of future proof: soon, you will have a query that your pre-aggregation can't support, and you'll have to build more pre-aggregations.

Pre-aggregation is still a tool in the toolbox, but you should first consider switching to a purpose-built column-store like Redshift/BigQuery/Snowflake.

mslot · on Nov 30, 2016

(Marco from Citus Data)

Pre-aggregation is less common in data warehouses use-cases than it is in real-time use-cases. Data warehouses are generally used by a small number of users that run ad-hoc reporting queries and can take anywhere from several seconds to a day to complete. With queries changing frequently and high query times being acceptable, pre-aggregation does not provide that much benefit.

Citus is meant for applications with a large number of users (e.g. dashboards [1]). Investing in pre-aggregation is often worth doing because it makes for more responsive applications and lower hardware cost.

> It's fragile: in this example, it depends on your data coming in exactly-once, in-order.

The solution proposed in the blog post does allow out-of-order writes since it only looks at the database clock. Exactly-once delivery depends on a) the database clock not going backwards, b) writes taking less than a minute. However, it is also possible to guarantee exactly-once delivery with a more complex solutions (e.g. create staging table, or use a version number that gets increased atomically through some locks).

[1] https://blog.cloudflare.com/scaling-out-postgresql-for-cloud...

glial · on Nov 30, 2016

> you should first consider switching to a purpose-built column-store like Redshift/BigQuery/Snowflake.

True, unless you have regulatory requirements that you can't store your data off-site.

nattaylor · on Nov 30, 2016

Something like self hosted HDFS + Parquet + Presto/Hive would work though, wouldn't it?

georgewfraser · on Nov 30, 2016

Well there's lots of good on-prem columnar data warehouses. They're expensive, but you have to think in terms of total cost of ownership: a good warehouse will require less time from your team building workarounds.

greggyb · on Dec 4, 2016

I'm not sure what your threshold of expensive is, but this is available in MS SQL Server. You don't necessarily need to purchase an application solution or a specific DW RDBMS.

https://www.microsoft.com/en-us/sql-server/sql-server-editio...

brightball · on Nov 29, 2016

... tears of joy ...