I am genuinely interested in Materialize's capability to incrementally maintain ...

frankmcsherry · on Dec 3, 2020

All queries are incrementally maintained with the property that we do work proportional to the number of records in difference at each intermediate stage of the query plan. That includes those with lateral joins; they are not an exception.

I'm not clear on your "all sorts of limitations"; you'll have to fill me in on them?

gfody · on Dec 3, 2020

> I'm not clear on your "all sorts of limitations"; you'll have to fill me in on them?

this feels like bait but honestly I'm under the impression that incrementally updating materialized views (where optimal = the proportion of changed records) just isn't always possible. for example, max and min aggregates aren't supported in SQL Server because updating the current max or min record requires a query to find the new max or min record - that's not considered an incremental update and so it's not supported and trying to materialize the view fails. there are a number of cases like this and a big part of problem solving with SQL Server is figuring out how to structure a view within these constraints. if you can then you can rest assured that updates will be incremental and performant - this is important because performance is the feature, if the update is slow then my app is broken. if Materialize has a list of constraints shorter than SQL Server's then you're sitting on technology worth billions - it's hard for me to believe that your list of constraints is "there are none" especially when there are explicit-but-vague performance warnings in the docs.

rkhaitan · on Dec 3, 2020

(Disclaimer: I'm one of the engineers at Materialize)

> for example, max and min aggregates aren't supported in SQL Server because updating the current max or min record requires a query to find the new max or min record

This isn't a requirement in Materialize, because Materialize will store values in a reduction tree (which is basically like a min / max heap) so that when we add or remove a record, we can compute a new min / max in O(log (total_number_of_records)) time in the worst case (when a record is the new min / max). Realistically, that log term is bounded to 16 (it's a 16-ary heap and we don't support more than 2^64 records). Computing the min / max this way is substantially better than having to recompute with a linear scan. This [1] provides a lot more details on how we compute reductions in Materialize.

> there are obviously limits to what can be efficiently maintained

I think we fundamentally disagree here. In our view, we should be able to maintain every view either in linear time wrt the number of updates or sublinear time with respect to the overall dataset, and every case that doesn't do so is a bug. The underlying computational frameworks [2] we're using are designed for that, so this isn't just like a random fantasy.

> if Materialize has a list of constraints shorter than SQL Server's then you're sitting on technology worth billions

Thank you! I certainly hope so!

[1]: https://materialize.com/robust-reductions-in-materialize/ [2]: https://github.com/timelydataflow/differential-dataflow/blob...

gfody · on Dec 3, 2020

> In our view, we should be able to maintain every view either in linear time wrt the number of updates or sublinear time with respect to the overall dataset, and every case that doesn't do so is a bug.

This is awesome and I believe that should be technically possible for any query given the right data structure. The reduction tree works for min/max but is it a general solution or are there other data structures for other purposes - n per x and top subqueries come to mind. Is it all handled already or are there some limitations and a roadmap?

rkhaitan · on Dec 4, 2020

I'm not entirely sure what you mean by n per x, but if by top you mean something like "get top k records by group" then we support that. See [1] for more details. top-k is actually also rendered with a heap-like dataflow

When we plan queries we are rendering them into dataflow graphs that consist of one or more dataflow operators transforming data and sending it to other operators. Every single operator is designed to do work proportional to the number of changes in its inputs / outputs. For us, optimizing our performance a little bit less a matter of the right data structures, and more about expressing things in a dataflow that can handle changes to inputs robustly. But the robustness is more a question of "what do are my constant factors when updating results" and not "is this being incrementally maintained or not".

We have a known limitations page in our docs here [2] but it mostly covers things like incompleteness in our SQL support or Postgres compatibility. We published our roadmap in a blog post a few months ago here [3]. Beyond that everything is public on Github [4].

[1]: https://materialize.com/docs/sql/idioms/ [2]: https://materialize.com/docs/known-limitations/ [3]: https://materialize.com/blog-roadmap/ [4]: https://github.com/MaterializeInc/materialize

frankmcsherry · on Dec 3, 2020

Min and max work using a hierarchical reduction tree, the dataflow equivalent of a priority queue. They will update, under arbitrary changes to the input relation, in time proportional to the number of those changes.

> [...] it's hard for me to believe that your list of constraints is "there are none" especially when there are explicit-but-vague performance warnings in the docs.

I think we're done here. There's plenty to read if you are sincerely interested, it's all public to both try and read, but you'll need to find someone new to ask, ideally with a less adversarial communication style.

gfody · on Dec 3, 2020

Sorry I’m definitely overly pessimistic when it comes to new database tech - you’ll find us industry hardened rdbms users hard to convince (we’ve been through a lot) thanks for chatting!