Hacker News new | past | comments | ask | show | jobs | submit login
Eventual Consistency isn’t for Streaming (materialize.io)
165 points by arjunnarayan on July 14, 2020 | hide | past | favorite | 49 comments



I agree with the other commenter. Eventual consistency has always been roughly a synonym for "tactical lack of consistency." The reason this works is that inconsistency is, in many business domains, not such a big deal as we make it out to be. Most business are used to data lagging behind, documents being filed incorrectly, decisions being changed and half of documents referring to the old decision, to mention just a few possibilities. As long as everything is dated and there are corroborating versions of all facts, this can be untangled by experts in the few cases it really matters. Most of the time, it doesn't matter that much.

Eventual consistency is embracing this philosophy of a lack of consistency for computer systems too, on the basis that maintaining actual consistency would be too expensive/complex/slow, which is frequently the case.

This of course, in principle, can lead to ever degrading consistency and since you can't assume everything is consistent, you also cannot really verify consistency in any other way than heuristically, as another commenter suggested.

Eventual consistency is a design driven by practical needs. It is never a path to reach complete data purity.

And this applies both to streaming and batch tasks alike.


> the basis that maintaining actual consistency would be too expensive/complex/slow, which is frequently the case.

Maintaining actual consistency is seldom more complex - the opposite is true, eventual consistency can lead to mind boggling complexity (because it's very hard to reason about your guarantees anymore... even the "eventual correctness" guarantee; in practice it's more often than not a handwavy "yeah, it's likely probably correct in many cases, and if you find something wrong, we'll take it as a bug and fix it. Or at least claim to fix it, because you know, it might be hard to reproduce". Good enough for usecases like advertising, I guess)

Too expensive/slow is the typical reason for eventual consistency - but the whole point of materialize.io is to challenge this "too expensive/slow" assumption.


> but the whole point of materialize.io is to challenge this "too expensive/slow" assumption.

how exactly is it challenging it. Spanner is too expensive.


By moving up the stack a bit (managing computation, rather than storage) we can provide consistency using techniques other than just using the guarantees provided by the storage itself. This isn't a new observation (Dryad/DryadLINQ made it wrt MapReduce/Hadoop, among other examples I'm sure) but that is where the trade-off lies.

If you instead implement low-latency systems where each step along a dataflow involves a round-trip through replicated highly available storage, Spanner if you like or even just Kafka, then 100% you might reasonably conclude that eventual consistency is the right call. This is roughly the situation that microservice implementors currently find themselves in. I don't think it is a great situation to be in, personally.

The value proposition with something like Materialize (and there are other options) is that you can get consistency and performance if you can express your computation as something more structured than imperative code that writes to and reads from storage. In our case, the "something" is SQL.

Hope that helps!


Hey - great work with materialize.io, I've always wanted to play more with it but life always got in the way so far :(

One question I have for you is whether it would be appropriate for processing where you need to iterate (think e.g. connected components in a graph, where you repeatedly broadcast the component ID to the neighboring nodes: can this be somehow done with materialize's version of SQL? You can of course do looping with timely - but, how do you do that with SQL?


In SQL you would most likely be directed to use `WITH RECURSIVE`, which is something we plan to do, but not yet.

It can be a bit gross to use WITH RECURSIVE, because there are often some constraints on the types of queries you can express (e.g. that the recursive body must conclude with a UNION/UNION ALL with some base case). Differential dataflow doesn't have that requirement, but we'll have to sort out whether we'll remove that requirement for Materialize, or impose the traditional constraints. There is a Chesterton's fence moment to have first.

Whether it ends up being "appropriate" or not will be a great thing to determine. I anticipate eating a lot of crow when it turns out to be lots slower than bespoke graph processors. :)

edit: Thanks, btw!


enterprise is rapidly approaching a data quality crisis where they have all these data warehouses but the final analytic artifacts end up being garbage and unusable for data science ... you will be hearing a lot more of this in the 2020s


A lot of this isn't related to data processing tools at all, but is a sort of downstream affect of the predominant "bugs are cheap" mentality of today.

The less guarantees of correctness on your daily/weekly/whatever releases, the messier your downstream data is gonna be. Monday's data is partially missing due to a bug in the client; Tuesday's data is weird/nonrepresentative because of a server bug that caused 5% of sessions to get disconnected; Wednesday's data is good; Thursday's data is good but was a release day and the feature changed so it means different stuff...


I'd argue that is a completely orthogonal problem. Business have extracted useful metrics out of their "eventually consistent" operations ever since operational research was invented.

That companies have collected more data than they can pay for processing of is a separate issue, I think.


I don't think that has as much to do with eventual consistency as with the old school system design of "the UI is a database editor, here are your plaintext fields" that still permeates a lot of businesses today.


If the term "asynchronous consistency" was adopted, I wonder if people would grok it easier.


This article isn't very convincing to me. I mean, I one hundred percent buy that eventually consistent stream processing systems can theoretically be subject to unbounded error. But eventual consistency isn't just a theoretical model. It's also a practical engineering decision, and so in order to evaluate its use for any given business purpose we have to see how it performs in practice. That is, what is the average/99.9%/max error? And we have to understand how business-critical the correct answer is. This article has some great examples of theoretical issues with eventually consistent stream processing computation, but it doesn't demonstrate that any real systems evince these problems under any given workload.


> Not all is lost! There are stream processing systems that provide strong consistency guarantees. Materialize and Differential Dataflow both avoid these classes of errors by providing always correct answers

yeah i was expecting to see what tradeoffs materialize made to get 'always correct' result. There is definitely something 'lost' for 'always correct' too.

I can only attribute this one sided take to deviousness. Personally , I would avoid whatever this company is selling.


Click around through the rest of their blog if you're interested in what those tradeoffs are. Let's not spread FUD unnecessarily.


> Let's not spread FUD unnecessarily

Ok. sorry that was an over reaction on my part.


Do you, like, work for Materlize? You're awful connected to many of the folks there and this is a pretty plant-like statement to make in the comments section for a piece by the company.


"Please don't post insinuations about astroturfing, shilling, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email hn@ycombinator.com and we'll look at the data."

https://news.ycombinator.com/newsguidelines.html

https://hn.algolia.com/?sort=byDate&dateRange=all&type=comme...


They're at Cockroach according to their about page.


For more concise and precise explanations of the rationale for these kinds of tools, see this paper: https://github.com/TimelyDataflow/differential-dataflow/raw/... -- here's the abstract:

> Existing computational models for processing continuously changing input data are unable to efficiently support iterative queries except in limited special cases. This makes it difficult to perform complex tasks, such as social-graph analysis on changing data at interactive timescales, which would greatly benefit those analyzing the behavior of services like Twitter. In this paper we introduce a new model called differential computation, which extends traditional incremental computation to allow arbitrarily nested iteration, and explain—with reference to a publicly available prototype system called Naiad—how differential computation can be efficiently implemented in the context of a declarative dataparallel dataflow language. The resulting system makes it easy to program previously intractable algorithms such as incrementally updated strongly connected components, and integrate them with data transformation operations to obtain practically relevant insights from real data streams.

See also this friendlier (and lengthier) online book: https://timelydataflow.github.io/differential-dataflow/


materialize.io is literally timelydataflow/ differential dataflow... same product, developed by Frank McSherry. It's not "the other tool", it's the very same.


Corrected. Thanks!


I'm actually just fundamentally confused about what is being argued.

I'm familiar with streaming, as a concept, from the likes of Beam, Spark, Flink, Samza - they do computations over data, producing intermediate results consistent with the data seen so far. These results are, of course, not necessarily consistent with the larger world because there could be unprocessed or late events in a stream, but they are consistent with the part of the world seen so far.

The advantage of streaming is the ability to compute and expose intermediate snapshots of the world that don't rely on the stream closing (As many streams found in reality are not bounded, meaning intermediate results are the only realizable result set). These intermediate results can have value, but that depends on the problem statement.

To examine one of the examples, let's use example 2, this aligns with the idea that we actually don't have a traditional streaming problem. The question being asked is "What is the key which contains the maximum value". There is a difference between asking "What is the maximum so far today" and "What was the maximum result today" -- the tense change is important because in the former the user cares about the results as they exist in the present moment, whereas the other cares about a view of the world in a time frame that is complete. It seems like the idea of "consistent" is being conflated with "complete", wherein "complete" is not a guaranteed feature of an input stream.

If anyone could clarify why the examples here isn't just a case of expecting bounded vs unbounded streams?


The argument is that without stronger consistency guarantees you can't do joins between two streams (or even something like argmax over a single stream, since it splits the stream into two subcomputations, which then have to be joined back together).

I think when folks say that eventual consistency is okay, they're thinking about simple aggregates - where transient incorrectness in the result is indistinguishable from noise.

But if you want to do joins, you really want to be able to reason about your unbounded streams causally - Flink, Beam, (and as another commenter points out, Firebase as well) provide stronger consistency guarantees on computations over unbounded streams.


This, of course, presumes that ppl want to do JOINs. Wouldn't eventual consistency be mostly NoSQL where joins are not an issue?


The issue being pointed out here isn't that the computation is out of sync with the outside world... it's that it's our of sync with _itself_! It will return answers that are not just stale, but inaccurate for any time in history.

This might still be fine, depending on your needs, but IMO a legitimate distinction.


In both examples 2 and 3, the author reads the same stream twice independently and assumes that a join is not synchronized between the transformed streams. This seems like a fundamental flaw in their offering.

Pushing in a timestamp along with the max/variance change stream[1]. And then using the timestamp to synchronize the join[2] would naturally produce a consistent output stream.

I quoted flink because they have the best docs around. But it should be possible in most streaming systems. Disclaimer, I used to work for the fb streaming group and have collaborated with the flink team very briefly.

[1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/t...

[2] https://ci.apache.org/projects/flink/flink-docs-release-1.11...


The aim of the examples is to show what goes wrong in eventually consistent systems where it's possible that two reads of a stream may not be consistent with respect to each other. The examples are not intended to say that such anomalies can't be fixed by providing stronger consistency guarantees by using timestamps.


> you should be prepared for your results to be never-consistent

Isn't this a core feature of distributed systems? How can you be "consistent" if there's a network failure between some writer and the stream? How can you tell a network failure from a network delay? How can you tell a network delay from any other delay?

And finally, how can you even talk about "up-to-date" data if the reader doesn't provide their "date" (ie, a logical timestamp)?


> Isn't this a core feature of distributed systems? How can you be "consistent" if there's a network failure between some writer and the stream? How can you tell a network failure from a network delay? How can you tell a network delay from any other delay?

This is covered by the CAP theorem. https://en.wikipedia.org/wiki/CAP_theorem

The basic solution is: If you need consistency and there's too much network failure (or delay), you'll have to pause operations and wait until the network is fixed.

If there's only a bit of network failure (or delay), consistency stays possible using quorum protocols such as Paxos and Raft.

> how can you even talk about "up-to-date" data if the reader doesn't provide their "date" (ie, a logical timestamp)?

Implicit causality helps.

You're right that there may be no definite logical time, but it often doesn't matter.

When a program issues a read command, the logical timestamp is, implicitly, greater than the timestamp of all results previously received from the network that were inputs to produce the read command.

So the rest of the network "knows" something about the logical time of the read command. It's not an exact logical time, and if the timestamps aren't passed around, it might not even be an inequality. It's more like a logical property that relates dependent values.

If done right, that's enough to ensure strict consistency in observable results.

Unless the program issuing reads does wild things with value speculation. You may have heard how much things can go wrong with speculative execution...


I'm still trying to figure out how you avoid speed of light propogation delays to get immediate consistency.


There's been plenty of work in the past on weaker correctness guarantees for stream processing system (e.g. concepts like rollback and gap recovery from Aurora). Not sure it's an either/or between eventually consistent and strong consistency.


Side question - has anyone tried using Materialize beyond toy workloads? Can I move billions of rows off of a batch workflow on Snowflake onto Materialize and suddenly everything is near realtime?


I keep falling for these clickbait titles in the hopes I will find a fair argument. However, the moment I realize the article is trying to sell me a product based around an argument, I lose faith on the perspective of the writer.

If the title was something more honest such as “How product X solves for Y” I’d feel more compelled to put trust on the analysis being objective.


Firebase provides causal consistency. By subscribing to streams (listen), the client opts into which data sources it was consistent snapshots of, then all distinct client streams are bundled up and delivered in order over the wire. It's a very elegant model which does not get in the way and has nice ergonomics.


so, if i understand the article correctly, for purposes of realtime reporting/monitoring (streaming, as stated), eventual consistency is not an appropriate "store" to hook into because you cant know when things have become consitstent, and reliable streaming of (near?) realtime data requires some chance for that to occur

is that a correct interpretation?


TL;DR: accessing materializations is necessarily a snapshot.

This article reads as though the author hadn't shifted mindset from "the database will solve it for me" to "I'm taking on the relevant subset of problems in my use case". This seems off given that they're trying to sell a streaming product. They claim their product avoids problems by offering "always correct" answers which requires a footnote at the very least but none was given.

Point of note: The consistency guarantee is that upon processing to the same offset in the log that, given that you have taken no other non-constant input, you will have the same computational result as all other processes executing semantically equivalent code.

I take this sort of comment as abusive of the reader:

> What does a naive application of eventual consistency have to say about > > -- count the records in `data` > select count(*) from data > > It’s not really clear, is it?

A naive application of eventual consistency declares that along some equivalent of a Lamport time stamp across the offsets of shards in the stream, the system will calculate account of records in data as of that offset. Given the ongoing transmission of events that can alter the set data, that value will continue changing as appropriate and in a manner consistent with the data it processes. The new answers will be given when the query is run again or it may even issue an ongoing stream of updates to that value.

Maybe it got better as the article went on...


I appreciate that the downvote mechanism is low friction. I wish it were easier to learn and improve from it too.


I didn't downvote you either. But, in addition to what quodlibetor wrote:

This is a good article from a high-profile author. The way you are criticizing it comes across as ignorant and narcissistic.

"This article reads as though the author hadn't shifted mindset from..."

You are claiming that the author is looking at the problem in the wrong way, and you came to that conclusion before even reading the entire post. It's okay to not be interested in a topic, but stating this in public does not add anything of substance to the discussion.

"I take this sort of comment as abusive of the reader"

This is just offensive, you're basically saying that the comment is stupid. It seems like you are looking for some validation of your intelligence, by the article or by the comment section here.


Thank you very much.

I'm sorry for expressing ignorant perspectives and behaving narcissistically. My intention was not to call the author stupid or ask for validation of my intellect but I can see how that comes out - thank you for identifying that. I'm sorry and thank you for making it through my comment and being willing to point my failure out to me. I really appreciate it.

I'm actually quite interested in the topic of stream processing and eventual consistency. I was one of the contributors to an open source stream processing based project that won architectural awards at a reasonably big conference, spent a little time working in a machine learning oriented startup based on stream processing with some of the creators of Apache Beam/DataFlow, and now my daily labor is implementing stream processing in another startup. This area of systems design (event sourcing specifically) has been an a bit of professional obsession and labor of love for me over the last five years or so now.

I very clearly failed to express any of that in my post. I further clearly failed to manage my emotional reaction to the article and threw anything useful I may have had an opportunity to add under the bus of my words. I'm sorry for that too.

Again - Thank you very much for helping me grow. I keep working on being a better human and appreciate the help.


I didn't downvote you, but as an outside observer I can see that folks might take issue with the fact that you start with a tl;dr, suggesting that you are summarizing the whole article, followedan in-depth analysis, and then ending by stating that you didn't read the whole thing.


Thank you very much. I was really excited to read the article and became very disappointed but my take was too hot and insufficient. I definitely should have done better. Thank you for expanding my perspective.


Almost every distributed system (including "simple" client-server systems) is eventually consistent. And all systems are distributed.

It's great that your DB is ACID and anyone who queries it gets the latest greatest but in reality you also have out of date caches, ORM models that haven't been persisted, apps where users modifying data that hasn't been pushed back to the server and a million other examples.

I'm sure it's possible to create a consistent system but I'm also sure it's not practical. No one does it.

Instead of constantly fighting eventual consistency just learn to embrace it and its shortcomings. Design systems and write code that are resilient to splits in HEAD and provide easy methods to merge back to a single truth.


There is a huge difference between having an ACID store of Truth surrounded by eventual consistency, vs making even your store of Truth eventually consistent. You're basically doubling or tripling your work for any given constraint because you have to both monitor after-the-fact violations and build in a way to resolve those violations.

This is on top of regular "nope, can't do that" code that you would write in both systems.


It's just the opposite, in my experience. If you have an ACID database that's supposed to represent the current state of the world, you have to handle both transaction rollbacks and logical inconsistencies. If you have a streaming system where you record an event log and generate the current state of the world from that, you already have the logic to recover from inconsistencies and can reuse it.


> I'm sure it's possible to create a consistent system but I'm also sure it's not practical. No one does it.

Billions of dollars flow through fully consistent systems every day. The basic IT concept for smaller hedge funds is "buy the biggest MSSQL machine available on the planet and move on." The big ones have custom frameworks that resemble Frank's arguments here, though the abstractions are different.


> The basic IT concept for smaller hedge funds is "buy the biggest MSSQL machine available on the planet and move on."

And the result is exactly what grandparent was complaining about: sure your database server is full ACID, but a trader is looking at numbers on their screen that are out of date (and pressing the trade button on that basis), and that's what actually matters.


> No one does it.

Oh, some people do. I used this EXACT phrase when I came in to fix an analytics system at a healthcare company that was plagued with analytics problems. They had 5 senior engineers, fulltime, working on this system for years. It had persistent problems and could not be modified in any meaningful way. Upstream systems sent data through multiple SQS topics (duplicate and out of order data) fed into lambda, fed into a giant cache-db which tried to catch dupes and order data, fed into files, processed in batch. It was a horror show in complexity and costing (despite the near-free lambdas). A distributed set of large data streams we feeding into a singular database which was processed, multiple times and put back in the same database. What's billions of inserts into an amazon postgres db, per hour? The company cloud infrastructure gave 0 other tools to work with. I shored up the batch processing (which had all kinds of try catch everywhere, despite a fixed schema) and went on to another company. Medical company software is always a ball of fail.


> I shored up the batch processing (which had all kinds of try catch everywhere, despite a fixed schema) and went on to another company

I was hoping for a happier ending there. What could have potentially fixed their situation?


My company is an automated insurance broking service which issues about 4% of all UK motor insurance policies (by number, not value).

We have consistency across our distributed system (~75 services currently) for all the fundamentals of our business. It is not difficult to do at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: