Hacker News new | past | comments | ask | show | jobs | submit login

Pre-aggregating data like this is a last resort in data warehousing. It's fragile: in this example, it depends on your data coming in exactly-once, in-order. It's also the opposite of future proof: soon, you will have a query that your pre-aggregation can't support, and you'll have to build more pre-aggregations.

Pre-aggregation is still a tool in the toolbox, but you should first consider switching to a purpose-built column-store like Redshift/BigQuery/Snowflake.




(Marco from Citus Data)

Pre-aggregation is less common in data warehouses use-cases than it is in real-time use-cases. Data warehouses are generally used by a small number of users that run ad-hoc reporting queries and can take anywhere from several seconds to a day to complete. With queries changing frequently and high query times being acceptable, pre-aggregation does not provide that much benefit.

Citus is meant for applications with a large number of users (e.g. dashboards [1]). Investing in pre-aggregation is often worth doing because it makes for more responsive applications and lower hardware cost.

> It's fragile: in this example, it depends on your data coming in exactly-once, in-order.

The solution proposed in the blog post does allow out-of-order writes since it only looks at the database clock. Exactly-once delivery depends on a) the database clock not going backwards, b) writes taking less than a minute. However, it is also possible to guarantee exactly-once delivery with a more complex solutions (e.g. create staging table, or use a version number that gets increased atomically through some locks).

[1] https://blog.cloudflare.com/scaling-out-postgresql-for-cloud...


> you should first consider switching to a purpose-built column-store like Redshift/BigQuery/Snowflake.

True, unless you have regulatory requirements that you can't store your data off-site.


Something like self hosted HDFS + Parquet + Presto/Hive would work though, wouldn't it?


Well there's lots of good on-prem columnar data warehouses. They're expensive, but you have to think in terms of total cost of ownership: a good warehouse will require less time from your team building workarounds.


I'm not sure what your threshold of expensive is, but this is available in MS SQL Server. You don't necessarily need to purchase an application solution or a specific DW RDBMS.

https://www.microsoft.com/en-us/sql-server/sql-server-editio...




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: