I agree with the other commenter. Eventual consistency has always been roughly a...

virgilp · on July 14, 2020

> the basis that maintaining actual consistency would be too expensive/complex/slow, which is frequently the case.

Maintaining actual consistency is seldom more complex - the opposite is true, eventual consistency can lead to mind boggling complexity (because it's very hard to reason about your guarantees anymore... even the "eventual correctness" guarantee; in practice it's more often than not a handwavy "yeah, it's likely probably correct in many cases, and if you find something wrong, we'll take it as a bug and fix it. Or at least claim to fix it, because you know, it might be hard to reproduce". Good enough for usecases like advertising, I guess)

Too expensive/slow is the typical reason for eventual consistency - but the whole point of materialize.io is to challenge this "too expensive/slow" assumption.

dominotw · on July 14, 2020

> but the whole point of materialize.io is to challenge this "too expensive/slow" assumption.

how exactly is it challenging it. Spanner is too expensive.

frankmcsherry · on July 14, 2020

By moving up the stack a bit (managing computation, rather than storage) we can provide consistency using techniques other than just using the guarantees provided by the storage itself. This isn't a new observation (Dryad/DryadLINQ made it wrt MapReduce/Hadoop, among other examples I'm sure) but that is where the trade-off lies.

If you instead implement low-latency systems where each step along a dataflow involves a round-trip through replicated highly available storage, Spanner if you like or even just Kafka, then 100% you might reasonably conclude that eventual consistency is the right call. This is roughly the situation that microservice implementors currently find themselves in. I don't think it is a great situation to be in, personally.

The value proposition with something like Materialize (and there are other options) is that you can get consistency and performance if you can express your computation as something more structured than imperative code that writes to and reads from storage. In our case, the "something" is SQL.

Hope that helps!

virgilp · on July 14, 2020

Hey - great work with materialize.io, I've always wanted to play more with it but life always got in the way so far :(

One question I have for you is whether it would be appropriate for processing where you need to iterate (think e.g. connected components in a graph, where you repeatedly broadcast the component ID to the neighboring nodes: can this be somehow done with materialize's version of SQL? You can of course do looping with timely - but, how do you do that with SQL?

frankmcsherry · on July 14, 2020

In SQL you would most likely be directed to use `WITH RECURSIVE`, which is something we plan to do, but not yet.

It can be a bit gross to use WITH RECURSIVE, because there are often some constraints on the types of queries you can express (e.g. that the recursive body must conclude with a UNION/UNION ALL with some base case). Differential dataflow doesn't have that requirement, but we'll have to sort out whether we'll remove that requirement for Materialize, or impose the traditional constraints. There is a Chesterton's fence moment to have first.

Whether it ends up being "appropriate" or not will be a great thing to determine. I anticipate eating a lot of crow when it turns out to be lots slower than bespoke graph processors. :)

edit: Thanks, btw!

dustingetz · on July 14, 2020

enterprise is rapidly approaching a data quality crisis where they have all these data warehouses but the final analytic artifacts end up being garbage and unusable for data science ... you will be hearing a lot more of this in the 2020s

majormajor · on July 14, 2020

A lot of this isn't related to data processing tools at all, but is a sort of downstream affect of the predominant "bugs are cheap" mentality of today.

The less guarantees of correctness on your daily/weekly/whatever releases, the messier your downstream data is gonna be. Monday's data is partially missing due to a bug in the client; Tuesday's data is weird/nonrepresentative because of a server bug that caused 5% of sessions to get disconnected; Wednesday's data is good; Thursday's data is good but was a release day and the feature changed so it means different stuff...

kqr · on July 14, 2020

I'd argue that is a completely orthogonal problem. Business have extracted useful metrics out of their "eventually consistent" operations ever since operational research was invented.

That companies have collected more data than they can pay for processing of is a separate issue, I think.

delusional · on July 14, 2020

I don't think that has as much to do with eventual consistency as with the old school system design of "the UI is a database editor, here are your plaintext fields" that still permeates a lot of businesses today.

PeterCorless · on July 14, 2020

If the term "asynchronous consistency" was adopted, I wonder if people would grok it easier.