Materialize: A Streaming Data Warehouse

luhn · on Feb 18, 2020

I didn't really understand what the product actually did after reading this blog post or the products page. I found the docs much more edifying:

> Materialize lets you ask questions about your data, and then get the answers in real time.

> Why not just use your database’s built-in functionality to perform these same computations? Because your database often acts as if it’s never been asked that question before, which means it can take a long time to come up with an answer, each and every time you pose the query.

> Materialize instead keeps the results of the queries and incrementally updates them as new data comes in. So, rather than recalculating the answer each time it’s asked, Materialize continually updates the answer and gives you the answer’s current state from memory.

> Importantly, Materialize supports incrementally updating a much broader set of views than is common in traditional databases (e.g. views over multi-way joins with complex aggregations), and can do incremental updates in the presence of arbitrary inserts, updates, and deletes in the input streams.

https://materialize.io/docs/

yomly · on Feb 19, 2020

This reminds me a lot about Noria DB. Wonder if anyone familiar with both can shed any further light?

benesch · on Feb 19, 2020

Indeed, Materialize is quite similar to Noria, and has the Frank McSherry stamp of awesomeness. [0] We know many of the Noria folks and have a lot of respect for them and their work. I also worked on the Noria project for a summer in college, and am a full-time engineer at Materialize now.

The biggest difference is one of intended use. Noria is, first and foremost, a research prototype, intended to explore new ideas in systems research. Materialize, by contrast, is intended to be a rock-solid piece of production infrastructure. (Much of the interesting research, in timely and differential dataflow, is already done.) We've invested a good bit in supporting the thornier bits of SQL, like full joins, nested subqueries, correlated subqueries, variable-precision decimals, and so on. Noria's support for SQL is less extensive; I think the decisions have been guided mostly by what's necessary to run its lobste.rs and HotCRP benchmarks. Make no mistake: Noria is an impressive piece of engineering, but, as an enterprise looking to deploy Noria, there's no one you can pay for support or to implement feature requests.

One area where Noria shines is in partial materialization. Details are in the Noria paper [1], but the tl;dr is that Noria has a lot of smarts around automatically materializing only the subset of the view that is actually accessed, while presently Materialize requires that you explicitly declare what subsets to materialize. We have some plans for how to bring these smarts to Materialize, but we haven't implemented them yet.

Also worth noting is that Materialize's underlying dataflow engine, differential dataflow, has the ability to support iterative computation, while Noria's engine requires an acyclic dataflow graph. We don't yet expose this power in Materialize, but will soon. Put another way: `WITH RECURSIVE` queries are a real and near possibility in Materialize, while (as I understand it) `WITH RECURSIVE` queries would require substantial retooling of Noria's underlying dataflow engine.

One of the creators of Noria, Jon Gjengset, did an interview on Noria [2] that covered some of differences between Noria and differential dataflow from his perspective, which I highly recommend you check out as well!

[0]: https://twitter.com/frankmcsherry/status/1056957760435376129...

[1]: https://jon.tsp.io/papers/osdi18-noria.pdf

[2]: https://notamonadtutorial.com/interview-with-norias-creator-...

benesch · on Feb 19, 2020

I forgot to mention: differential dataflow is capable of providing much stronger consistency guarantees than Noria is. Differential dataflow is consistency preserving—if your source of truth provides strict serializability, differential can also provide strict serializability—while Noria provides only eventual consistency.

Tapping into the consistency-preserving features in Materialize is a bit complicated at the moment, but we're actively working on improving the integration of our consistency infrastructure for MySQL and PostgreSQL upstreams.

choppaface · on Feb 19, 2020

Do you have an example of where one might use WITH RECURSIVE?

gbrgr · on Feb 22, 2020

A classical example is graph reachability. You'd express reachability via WITH RECURSIVE stating that nodes are reachable if there is a path of arbitrary length between them. (Recursion is needed since a plain SQL query can only query for paths up to a fixed length).

jadbox · on Feb 19, 2020

It also sounds a bit like InfluxDB to me..

Ar-Curunir · on Feb 19, 2020

That’s built on top of differential data flow, the same thing underlying Materialize

benesch · on Feb 19, 2020

It’s not, actually. Noria has its own custom dataflow engine.

Ar-Curunir · on Feb 19, 2020

oops, for some reason I thought that the paper talked about using differential dataflow, but it seems that I was mistaken.

pedalpete · on Feb 18, 2020

I had the same response. So reading what you've posted here, it appears to be a smart(er) cache.

Do you think it's valuable to have this as a service rather than roll your own solution?

jacques_chester · on Feb 18, 2020

I think the main step forward is that it has an efficient means for calculating what views change when a new datum arrives. Which means that it could, in theory, hold a very large amount of views for relatively low overhead.

It's a surprisingly difficult problem. I wouldn't roll my own.

benesch · on Feb 18, 2020

Yes, absolutely. Entire dissertations have been written on the topic: https://dash.harvard.edu/bitstream/handle/1/14226097/KATE-DI...

Efficiently maintaining views over arbitrarily complex computations is one of the two hard problems in computer science for a reason [0].

[0]: https://martinfowler.com/bliki/TwoHardThings.html

jacques_chester · on Feb 18, 2020

Never let HN haters (among whom I am frequently numbered, it ought to be noted) get you down.

After skimming a bit of the differential dataflow writing I am really impressed. This is deep computer science doing what it does best, which is to do much more with much less.

benesch · on Feb 18, 2020

I’m glad to hear it! One of my favorite ways to view Materialize is as bringing differential dataflow to the masses. Differential dataflow is deep, elegant stuff, but it requires some serious CS chops to grok.

SQL is lacking in elegance but abundant in popularity. Building the SQL translation layer has been a fun exercise in bridging the two worlds.

airstrike · on Feb 19, 2020

This comment reminded me of this classic: https://news.ycombinator.com/item?id=9224

Qwuke · on Feb 19, 2020

The differential dataflow codebase was really polished and optimized last I saw - when they say "demand milliseconds" I think they put the effort into delivering that.

Additionally, given it will be Apache licensed in 4 years, I think it'd be good to ask "Can I wait?" before going all in

slap_shot · on Feb 18, 2020

> We believe that streaming architectures are the only ones that can produce this ideal data infrastructure.

I just want to say this is a very dangerous assumption to make.

I run a company that helps our customers consolidate and transform data from virtually anywhere in their data warehouses. When we first started, the engineer in me made the same declaration, and I worked to get data into warehouses seconds after and event or record was generated in an origin system (website, app, database, salesforce, etc).

What I quickly learned was that analysts and data scientists simply didn't want or need this. Refreshing the data every five minutes in batches was more than sufficient.

Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data. So even if we could get the data in the warehouse in sub-minute latency, the jobs to transform that data ran every 5 minutes.

To be clear, I don't discount the need of telemetry and _some_ data to be actionable in a smaller time frame, I'm just weary of a data warehouse fulfilling that obligation.

In any event, I do think this direction is the future (an overwhelming amount of data sources allow change data capture almost immediately after an event occurs), I just don't think it's only architecture that can satisfy most analysts'/data scientists' needs today.

I would love to hear the use cases that your customers have that made Materialize a good fit!

Darkphibre · on Feb 18, 2020

I architected and implemented a true-realtime telemetry pipeline. The requirement was subsecond per-user aggregation and round-trip notification of thresholds exceeded. Took us a couple years, but when Halo 5 launched, we handled 2.5B events/hour without breaking a sweat (AMQP over Websockets). It's since been rolled out to multiple Microsoft 1st-party games.

The round-trip requirement was dropped before we launched, reducing the usage of the technology stack to pure telemetry gathering.

The analysts are all perfectly happy with 5-10 minute delays.

Link to my GDC talk, in case people are interested: https://www.youtube.com/watch?v=o098roxWAkA

benesch · on Feb 18, 2020

> I just want to say this is a very dangerous assumption to make.

I think we're actually arguing the same points here. It's not that every use case needs single-digit millisecond latencies! There are plenty of use cases that are satisfied by batch jobs running every hour or every night.

But when you do need real-time processing, the current infrastructure is insufficient. When you do need single-digit latency, running your batch jobs every second, or every millisecond, is computationally infeasible. What you need is a reactive, streaming infrastructure that's as powerful as your existing batch infrastructure. Existing streaming infrastructure requires you to make tradeoffs on consistency, computational expressiveness, or both; we're rapidly evolving Materialize so that you don't need to compromise on either point.

And once you have streaming data warehouse in place for the use cases that really demand the single-digit latencies, you might as well plug your analysts and data scientists into that same warehouse, so you're not maintaining two separate data warehouses. That's what we mean by ideal: not only does it work for the systems with real-time requirements, but it works just as well for the humans with looser requirements.

To give you an example, let me respond to this point directly:

> Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data. So even if we could get the data in the warehouse in sub-minute latency, the jobs to transform that data ran every 5 minutes.

The idea is that you would have your analysts write these ETL pipelines directly in Materialize. If you can express the cleaning/de-duplication/aggregation/projection in SQL, Materialize can incrementally maintain it for you. I'm familiar with a fair few ETL pipelines that are just SQL, though there are some transformations that are awkward to express in SQL. Down the road we might expose something closer to the raw differential dataflow API [0] for power users.

[0]: https://github.com/TimelyDataflow/differential-dataflow

vhold · on Feb 19, 2020

I think what might be really unique here that people aren't imagining, are the new possible applications of having <100ms updates on complex materialized views.

With sufficiently expressive SQL and UDF support there are whole classes of stateful services that are performing lookups, aggregations, etc, that could be written as just views on streams of data. Experts who model systems in SQL, but aren't experts in writing distributed stateful streaming services would basically be able to start deploying services.

Are there any plans to support partitioned window functions, particularly lag(),lead(),first(),last() OVER() ? That would be remarkably powerful.

benesch · on Feb 19, 2020

I agree wholeheartedly with your take!

Window functions are a particular favorite of mine, but we haven’t seen much customer demand for them yet, so they haven’t been officially scheduled on the roadmap. They require some finesse to support in a streaming system, as you have to reconstruct the potentially large window whenever you receive new data. Probably some interesting research to be done here, or at least some interesting blog posts from Frank.

Please feel free to file issues about any of these functions that you’d like to see support for! We especially love seeing sample queries from real pipelines.

jacques_chester · on Feb 19, 2020

I have a strong suspicion that bitemporalism makes a lot of these problems less problematic. The actual volumes of data are the same, but the all-or-nothingness of windowing over very large data sets in order to avoid missing anything that arrived late goes away.

I wrote shambolic stream-of-consciousness notes on it several years ago: https://docs.google.com/document/d/1ZlPp099_fV1lyYWACSyuWY_j...

The gist being that the mechanisms of windowing, triggering and retraction a la Beam are actually workarounds for a lack of bitemporalism.

benesch · on Feb 19, 2020

We’re thinking along very similar lines! We’ve got some of our thoughts around evolving how timestamps work in Materialize written down here: https://github.com/MaterializeInc/materialize/issues/1309

jacques_chester · on Feb 19, 2020

Sort of -- the problem I see in the event time / processing time distinction is that it's about instants rather than intervals. There are a number of models and queries that are not reliably expressible with instants alone, unless you reinvent intervals with them.

For example, if I rely on "updated-at" and infer that whatever record has the latest updated-at is the "current" record, then I may create the illusion that there are no gaps in my facts. That may not be so.

A reference system to look at is Crux: https://opencrux.com/

benesch · on Feb 19, 2020

> For example, if I rely on "updated-at" and infer that whatever record has the latest updated-at is the "current" record, then I may create the illusion that there are no gaps in my facts. That may not be so.

I believe that notion is captured by timely's capabilities [0]. Your capability has a current time, and you can only produce records at or greater than the current time. So you could produce a record at, say, t + 3, then t + 5, and then produce a record at t + 1. But not until you downgrade your capability to t + 6 will the record at t + 5 be considered final; downgrading your capability is how you indicate that you have the correct and final set of facts for all times less than t.

If your events can arrive out of order forever, then you have a problem, as you'll never be able to downgrade your capability because you'll never be willing to mark a time as "final." That's where bitemporalism (as mentioned in that issue I linked previously) comes into play. You can mark a result as final as of some processing time, and then issue corrections as of some processing time in the future if some out-of-order data arrives. Materialize will (likely) gain support for bitemporalism eventually, and the underlying dataflow engine supports arbitrary-dimension timestamps already.

Would be happy to chat about this more, if you're curious, but I feel like this discussion is getting a bit unwieldy for an HN thread! (At the very least I might need to put you in touch with Frank.) Feel free to reach out on GitHub [1] or our Gitter [2], or shoot me an email at benesch@materialize.io.

[0]: https://docs.rs/timely/0.11.1/timely/dataflow/operators/stru...

[1]: https://github.com/MaterializeInc/materialize/issues

[2]: https://gitter.im/MaterializeInc/community

frankmcsherry · on Feb 19, 2020

The underlying compute framework, differential dataflow, supports multi-temporal timestamps. The Crux folks were at one point looking at it for Juxt, though not sure what they concluded.

thom · on Feb 19, 2020

This feels like the philosophical conclusion that Kafka Streams has made, i.e. you don't have a strict watermark, and if you really want you can theoretically keep updating and retracting data forever, and build a pipeline that magically stays in sync.

jacques_chester · on Feb 19, 2020

Partially, in my understanding, but not fully. An advantage of bitemporalism that is hard to recreate is queries about past and future states of belief. "What do I believe is true today?" works well with accumulation and reaction and with standard normalised schemata.

"What do I believe I believed yesterday?" is slightly harder and needs additional information to be stored. You can rewind a stream and replay it up to the point of interest, but that can be quite slow.

"What did I believe today would be, last week?", "What is the history of my belief about X?", "have I ever believed Y about X?" etc are much harder to answer quickly without full bitemporalism. So too the problem of having implicit intervals that are untrue, which is where "updated at" can be so misleading.

chimychang77 · on Feb 19, 2020

I think the average data team at a startup using something like Redshift/BigQuery/Snowflake is using window functions quite extensively when writing analytical queries/building data pipelines so I'm surprised to hear you haven't see much customer demand for them.

If you trying to wholesale replace a "traditional" batch oriented data warehouse like the ones I mentioned above I think building support for window functions would be essential.

benesch · on Feb 19, 2020

To be clear, window functions are definitely on our radar! But you’d be surprised how many folks are delighted just to have more basic SQL features, like fully functional joins. Streaming SQL is surprisingly far behind batch SQL.

taurath · on Feb 19, 2020

I could see a usecase for rapid automated A/B testing, using ML to react to performance metrics. Why have human editors on cnn.com when you can in real time do story selection based on view traffic?

That said, hope we as technologists can find some better use cases than just stealing more of people’s a attention.

thom · on Feb 19, 2020

I work in the sports stats industry and we've been talking to Materialize for a while. We have a lot of demand for real time data, especially from the media and gambling sectors. The SLAs are mostly in seconds not milliseconds, so we currently don't have to wring every last ounce of latency out of our pipeline, but I'm very excited about the product because it offers a good balance of performance and developer productivity. I can even imagine a universe in which we directly offer the capability to customers to be able to build their own real-time KPIs in a language that is fairly accessible and easy to hire for.

derefr · on Feb 18, 2020

> Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data.

The point of Materialize, from my understanding, is that you don't put things into the data warehouse and then, as a separate step, run these enrichment/reporting jobs on it.

Instead, you register persistent, stateful enrichment "streaming jobs" (i.e. incrementally-materialized views) into the data warehouse; and then, when data comes into a table upstream of these views, it gets streamed into and through the related job to incrementally populate the matview.

I believe you end up with a consistent MVCC snapshot between the source data and its dependent inc-matviews, where you can never see new source data and old derived data in the same query; sort of as if the inc-matview were being updated by a synchronous ON INSERT trigger on the original raw-data ingest transaction. (Or—closer analogy—sort of like the raw table has a computed-expression index visible as a standalone relation.)

Also, the job itself is written as a regular OLAP query over a fixed-snapshot dataset, as a data-scientist would expect; but this gets compiled by the query planner into a continuous CTE/window query sorta thing, that spits out running totals/averages/etc. as it gets fed incremental row-batches—but only at the granularity that consumers are demanding from it, either by directly querying the inc-matview, or by sourcing it in their own streaming jobs. Which, of course, you do by just using a regular OLAP query on inc-matview X in your definition of inc-matview Y.

> I would love to hear the use cases that your customers have that made Materialize a good fit!

We haven't used Materialize yet, but it's a very promising fit for our use-case: high-level financial-instrument analytics on blockchain data. We need to take realtime-ish batches of raw transaction data and extract, in est, arbitrarily-formatted logs from arbitrary program-runs embedded in the transaction data (using heavy OLAP queries); recognize patterns in those that correspond to abstract financial events (more heavy OLAP queries); enrich those pattern-events with things like fiscal unit conversions; and then emit the enriched events as, among other things, realtime notifications, while also ending up with a data warehouse populated with both the high- and low-level events, amenable to arbitrary further OLAP querying by data scientists.

jamii · on Feb 19, 2020

> Instead, you register persistent, stateful enrichment "streaming jobs" (i.e. incrementally-materialized views) into the data warehouse; and then, when data comes into a table upstream of these views, it gets streamed into and through the related job to incrementally populate the matview.

This is correct. There is an example in the docs of dumping json records into a source and then using materialized views to normalize and query them - https://materialize.io/docs/demos/microservice/

je42 · on Feb 18, 2020

I think it would be helpful if you could dive deeper why you think " Refreshing the data every five minutes in batches" is "sufficient".

From my perspective: batching is more complicated, than batching. (Batching requires you to define parameters like batch size and interval, while streaming does not for example). But may be batching tools are simpler than streaming tools, but i am not so sure.

Batching in general has also high(er) latency. That's why I usually don't prefer it unless:

That said batching has an advantage over streaming, it can ammortise a cost that you only pay once per batch process. With streaming you would pay the cost for each items as it arrives.

Further, the mindset requirements for engineers that work with batching is different than for streaming.

Each of these items can be valid concern for batching vs streaming. However, I find it difficult to value statements like "Batching" is the default because the industry has been doing this for years by default.

I think the industry as a whole benefits when engineers in these kind of discussions repeat why certain conditions lead to a choice like batching.

TeMPOraL · on Feb 18, 2020

> I think it would be helpful if you could dive deeper why you think " Refreshing the data every five minutes in batches" is "sufficient".

Not OP, but I'm guessing because most of that data is not actionable in real-time. There's zero point to get real-time data to analysts or decision makers if they're not going to use it to make real-time decisions; arguably, it can be even counterproductive, leading to an organizational ADHD, where people fret over minute-to-minute changes, where they should be focusing on daily or monthly running averages.

jacques_chester · on Feb 18, 2020

While they focus on the very-fast-updates thing, I think their technology will apply to batch cases also. In either of streaming or batching I want to do the least possible work, their claim is that they can skip a lot of unnecessary computations automatically.

That said, I find that batch systems have enormous inertia due to simple don't-touch-it syndrome. A report got developed in 1992 for a manager who retired in 1998 and died in 2009. Each night it churns through 4 billion records in a twelve-way join that costs tens of thousands of dollars of computing time per year.

Who reads this report? Nobody. In fact, the person who asked for it read it two or three times and then stopped. But it's landed reliably in an FTP folder for 28 years and by god nobody is game to find out whether the CEO reads it religiously.

chrisjc · on Feb 18, 2020

Would it be fair to say this is a more OLAP-oriented approach to what KSqlDB (not KSql, but https://ksqldb.io/) does?

Seems that it's perhaps lacking the richness of how ksqldb uses Kafka Connectors (sinks and sources), but I don't see any reason you couldn't use Materialize in conjunction with ksqldb.

Eg:

KC-source --> ksql --> materialize --> kafka --> KC-sink

Question to Materialize...

What connectors (sinks and sources) do you have or plan to develop? Seems like it's mostly Kafka in and out at the moment.

Why would I use this over KSqlDB?

Can I snapshot and resume from the stream? Or do I need to rehydrate to re-establish state?

benesch · on Feb 18, 2020

> Would it be fair to say this is a more OLAP-oriented approach to what KSqlDB (not KSql, but https://ksqldb.io/) does?

I'm not sure I'd say it's "more OLAP." ksqlDB is about as OLAP as it gets, considering it doesn't support any sort of transactions or consistency. We think Materialize is quite a bit more powerful than what ksqlDB offers, thanks to the underlying technologies (timely/differential). For example, our joins are proper SQL joins, and don't require you to reason about the complicated and confusing difference between a stream and a table (https://docs.ksqldb.io/en/latest/developer-guide/joins/join-...). We also have preliminary support for maintaining the consistency properties of upstream OLTP data sources, and we'll be rolling out a more complete story here shortly.

> Seems that it's perhaps lacking the richness of how ksqldb uses Kafka Connectors (sinks and sources), but I don't see any reason you couldn't use Materialize in conjunction with ksqldb.

Is there something in particular about ksqlDB connectors that we don't seem to support? Our CREATE SOURCE command is quite powerful: https://materialize.io/docs/sql/create-source/.

> What connectors (sinks and sources) do you have or plan to develop? Seems like it's mostly Kafka in and out at the moment.

We already support file sources in a variety of formats, and support for Amazon Kinesis is on the short-term roadmap: https://github.com/MaterializeInc/materialize/issues/1239.

Are there other connector types you'd like to see?

> Can I snapshot and resume from the stream? Or do I need to rehydrate to re-establish state?

At the moment you can't snapshot and resume, but support for this is planned.

chrisjc · on Feb 19, 2020

Thanks for the detailed response. At this point i think the onus is on me to go and take a deeper look into timely/differential.

Important "sources" to me are obviously Kafka, but also MySQL and Mongo. Important "sinks" would be Snowflake (maybe through S3, or directly though PUT) and ElasticSearch. Although I imagine you might soon be telling me that you don't need a data warehouse once you have Materialize :)

dswalter · on Feb 18, 2020

Presumably one reason to use this is latency: materialize is built on differential and timely dataflow, some innovative frameworks (by Frank McSherry, who was at Microsoft Silicon Valley Research back in the day and is elsewhere on this HN discussion) that are intended to reduce the amount of computation in certain kinds of calculations. Materialized views are particularly ripe for those advances.

It's also written in rust instead of Java, so there's no JVM RAM penalty or GC to contend with.

drej · on Feb 19, 2020

I really like the pg protocol (like e.g. Cockroach), it let me use my usual tools. There are a few things I noticed:

1. It has a fairly rich support for types - these new-ish SQL engines often lack quite a lot of things, but this seems pretty decent. 2. I don't see any comparisons to KSQL, which seems to be the primary competitor. 3. Read the license. Read it carefully. It has a weird "will become open source in four years" clause, so keep that in mind. It also disallows it being hosted for clients to use (esentially as a DBaaS).

jamii · on Feb 19, 2020

The main difference vs KSQL is that we support standard SQL (roughly SQL92 + correlated subqueries + json so far) and provide strong consistency:

KSQL has a distinction between streams and tables, effectively giving you control over how views are materialized but also forcing you to do that work yourself. In Materialize you just write the same SQL that you would for a batch job and the planner figures out how to transform it into a streaming dataflow.

KSQL is also eventually consistent - late arriving records might cause changes to previously emitted answers. Materialize won't give you an answer for a given timestamp until it can guarantee that the answer is correct and won't be retroactively updated later.

Expect to see some performance comparisons soon too.

irfansharif · on Feb 18, 2020

(Linked in the post but) github repo: https://github.com/MaterializeInc/materialize

samuell · on Feb 18, 2020

For anyone interested in the details behind all of this, you should check out Frank's blog:

https://github.com/frankmcsherry/blog

kiwicopple · on Feb 19, 2020

For anyone that might be considering trying something similar with their own Postgres database (PG10+), we recently opensourced this: https://github.com/supabase/realtime

It's an Elixir (Phoenix) server that listens to PostgreSQL's native replication, transforms it into JSON, then blasts it over websockets.

I see that Materialize are using Debezium, which will give you a similar result, just with connectors to Kafka etc

thom · on Feb 19, 2020

The magic here isn't that it's giving you streaming updates of a database, it's that it's making some of those updates ridiculously fast. The point is that if you have a materialized view that takes a long time to update (or just a stack of views that are slow to return results), you now have results instantly whenever changes come in. If you built a bunch of code to work around this and do clever invalidation of cached data, you get to throw all that away. The input/output into the system is less of an issue overall.

yayr · on Feb 19, 2020

I am curious about the physical storage. Is it purely in-memory or is there a disk persistency possible? Is there some kind of data compression applied or what are the memory needs of it? Is it a row or column based data persistence pattern?

gaogao · on Feb 19, 2020

The "you may not cluster any server instances of the Licensed Work together for one use" in the license is a fairly tricky clause. Under this clause, how would one run a fault-tolerant instance of Materialize?

benesch · on Feb 19, 2020

You would need to contact us to buy an enterprise license. (Or wait four years for the license to convert to Apache 2.)

streblo · on Feb 18, 2020

How does materialize compare in performance (especially ingress/egress latency) to other OLAP systems like Druid or ClickHouse? Would love to see some benchmarks.

jamii · on Feb 19, 2020

We're working on benchmarks, but I expect that at the moment Materialize will be slower for one-off queries but faster for frequently repeated queries.

grumblestumble · on Feb 19, 2020

This appears to not be fully open source: https://github.com/MaterializeInc/materialize/blob/master/LI...

solidangle · on Feb 18, 2020

> Blazing fast results

I highly doubt this, given that the query engine is interpreted and non-vectorized. Queries are 10x to a 100x slower on a simple query, and 100x to 1000x slower on a query with large aggregations and joins without compilation of vectorization.

> Full SQL Exploration

Except for window functions it seems. These actually matter to data analysts.

benesch · on Feb 18, 2020

Considerations are completely different in a streaming context. It’s not so much about how fast you can churn through terabytes of data; it’s more about how quickly you can turn around the incremental computation with each new datum. There’s some serious research behind this product, in timely and differential dataflow, and I’d encourage you to check out some of that research before making sweeping performance claims. Frank’s blog post on TPC-H is a good place to start: https://github.com/frankmcsherry/blog/blob/master/posts/2017...

We definitely have some performance engineering work to do in Materialize, but don’t let the lack of vectorization scare you off. It’s just not as important for a streaming engine.

edmundsauto · on Feb 18, 2020

It's one thing to be skeptical and ask for evidence of speed, another to dismiss them out of hand due to a casual review of their website. Or did I miss it that you tried it out and found it wanting?

I work in an org with > 100 data scientists. I bet that 50% have never used window functions. I would guess than fewer than 20% know how to write one.

Your intent in this comment is unclear, but if you were looking to provide actionable feedback, you might want to reconsider your tone. This project looks like a pretty impressive feat of applied CS theory to doing useful stuff.

solidangle · on Feb 19, 2020

FWIW, I do think this project is really cool. I should have taken a little bit more time to write this comment, as it's overly negative right now.

1290cc · on Feb 19, 2020

Pretty cool tech although I feel they may have missed the moment as AWS, Azure and GCP are becoming hypercompetitive to solve all things related to data/storage. Azure has been churning out major updates to its services and clearly taking inspiration from companies like Snowflake. AWS I think hesitated to compete with Snowflake as they were running on AWS anyway - win/win for them.

Snowflake had incredible timing as they hit the market just before CFO's and non-tech business leaders realized the cost and talent needed to pull off a datalake successfully was more than they'd like. Those that were sick of the management jumped to Snowflake fast and AWS/Azure never really responded until recently.

Awesome to see all the innovative takes on solving these extremely technical problems! I love it!

manigandham · on Feb 19, 2020

Google's data product suite seems to be the most advanced today with a completely no-ops approach and very solid primitives to work with.

Snowflake still doesn't an answer to streaming data (other than their fragile Kafka connector) and Azure Synapse still isn't publically available even months after their announcement.

AWS had a good headstart but they keep piling on more products and features that it's now a big mess and requires yet another tool (AWS Lake Formation) just to wire it all up.

cbdumas · on Feb 18, 2020

I'm not sure if it's a mistake or just some cheeky humor but the "Known Limitations" link on their Docs page returns a 404.

frankmcsherry · on Feb 18, 2020

It looks like a caching bug somewhere in the web pipeline. It isn't running on Materialize, sadly.

There are a pile of known limitations, and our goal is to be 100% clear about them.

polskibus · on Feb 18, 2020

Hi Frank! Congrats on the progress - I remember listening to you describing Materialized at SIGMOD'19 in the big hall - it sounded like magic back then, glad to see it come true :) I have a question now - can you subscribe to view updates somehow, instead of querying them on an ad-hoc basis?

frankmcsherry · on Feb 18, 2020

Yes, the term we're using is "Sink", dual to a Source.

    https://materialize.io/docs/sql/create-sink/

At the moment you can create Kafka sinks, and we'll write the change log out to that topic. They are still baking though, and need some hardening for failure cases.

polskibus · on Feb 18, 2020

Thanks. What happens when a DDL is run on the source table (add or remove column for instance) ? Does it break the entire flow?

frankmcsherry · on Feb 18, 2020

DDLs that alter schemas definitely cause things to catch on fire at the moment (Debezium writes them to a side-topic, which isn't obviously synchronized with the data topics). In principle `CREATE` and `DROP` statements are things we could pass through, but some work needs to get done to pull the right information out of Debezium at the right moments (and drop the right prefixes of the existing Kafka topics if you re-create a table with the same name).

benesch · on Feb 18, 2020

To add a bit more context to Frank's answer: adding a column to a table will work just fine today, but the data in that column won't propagate through to Materialize. You'll continue to see the latest inserts, updates, and deletes, though. The new column can be incorporated by re-creating the source, though this does require re-reading the entire table from the beginning at the moment.

Removing a nullable column will also work fine along the same lines: that column will simply be filled in with NULL in Materialize for all new data. Removing a non-nullable column is backwards-incompatible, of course, and so Materialize will be unable to process any additional events for that topic, and so you'd have to recreate the source and restart from the beginning.

jacques_chester · on Feb 18, 2020

> Debezium writes them to a side-topic, which isn't obviously synchronized with the data topics

That's unfortunate. It seems like it would be ideal to feed schema changes through the same topic, so that the ordering is defined without the usual distributed-system hilarity.

arjunnarayan · on Feb 18, 2020

Agreed! We plan to contribute to the Debezium project over the next months along these lines. We’ll have more to say on this shortly!

gunnarmorling · on Feb 21, 2020

Nice, looking forward to your contributions. Let me know if you'd like to discuss anything upfront. On a general note, have you considered to rely on the schema of change events themselves (inline JSON schema, or Avro schema in a registry)? Not all connectors provide the separate schema history topic, you won't have to parse DDL and the synchronization issue is avoided at all. Happy to discuss in more detail.

Disclaimer: I'm the Debezium lead.

manigandham · on Feb 19, 2020

Congrats on the launch, always nice to see new products.

This is an interesting mix between the (now obsolete) PipelineDB, TimescaleDB with continuous aggregates, Kafka and other message systems with KSQL/ksqlDB/KarelDB, stream processing engines like Spark, and typical RDBMS like SQL Server with materialized views.

The amount of research to support complex and layered queries definitely sets this apart.

mason55 · on Feb 19, 2020

Not sure how the featuresets compare but AWS is releasing materialized views for Redshift sometime soon and one of the things it will support is incremental refresh (assuming your view meets some criteria).

I'm sure Materialize is better at this since it's purpose-built but if you're on Redshift you can get at least some of the benefits of incremental materialize.

jerglingu · on Feb 19, 2020

Materialized views, the upcoming ability to query directly from an RDS transactional DB through Redshift, and finally true separation of compute vs storage with RA nodes IMO make Redshift the market leader by a huge margin now. I haven't actually tested RA nodes yet, but if performance is even just a fraction of legacy nodes then the competition is already dead. Redshift already is one of the best engines to optimize a query

chrisjc · on Feb 19, 2020

It's been a while since I've used Redshift, but isn't it still dependent on data coming in via a COPY from S3? Any sort of Redshift materialized view offering would depend on batches of data landing in an underlying table or tables. The closest service offering from AWS is probably using Kinesis analytics (or Flink on KA) using their flavor of streaming SQL to join Kinesis streams forming new ones.

mason55 · on Feb 19, 2020

With the introduction of Spectrum you can back a Redshift table with data in S3 directly. I'm not sure how that interacts with the materialized views though. Probably not supported yet but I would expect it to be eventually.

wiradikusuma · on Feb 19, 2020

Materialize connects directly to event stream processors (like Kafka) --- how about Pulsar? (Goggling doesn't yield anything useful, Materialize and Pulsar are both name of brands of other things)

chris_overseas · on Feb 19, 2020

Connecting Pulsar and Materialize is of interest to me too and something I might try when I find time to do so. Note that Pulsar does have a Kafka compatibility layer already[0], so it might just work out of the box. If you try this I'd be keen to hear how it goes.

[0] https://pulsar.apache.org/docs/en/adaptors-kafka/

EDIT: I don't think this adaptor will work after all, it works by replacing the Kafka Java client library with its own, so is only applicable to Java applications.

jamii · on Feb 19, 2020

Adding sources is pretty easy. We just need them to provide:

* timestamped changes eg "record foo was deleted at time 42"

* watermarks eg "I've now sent you all the changes up to time 42"

Here's the implementation of the kafka source - https://github.com/MaterializeInc/materialize/blob/master/sr.... (There's also a bit of code in the parser/planner to wire up the CREATE SOURCE syntax, but it's fairly trivial.)

simo7 · on Feb 19, 2020

I'm wondering how this technology could work for OLAP cubes.

An OLAP cube that is automatically & incrementally kept in sync with the changes in the source data sounds promising.

Is that a potential use case?

polskibus · on Feb 19, 2020

Looking back at the project, knowing what you know now, if you were to start again (but without obtained rust skills), would you go with rust again or pick another toolbox?

jamii · on Feb 19, 2020

> would you go with rust again

Absolutely. Even aside from safety, rust has so many quality of life improvements over C++ - algebraic data types, pattern matching, checked errors instead of exceptions, sane standard library, easy packaging, readable error messages, #[must_use], immutable utf8 strings, traits instead of templates, hygienic macros etc.

Other than compile times and editor tooling, which are being actively worked on, the only real pain point I can think of is the poor support for custom allocators.

As for other alternatives, I don't personally have strong opinions on go but I think it's notable that none of the ex-cockroachdb folks at materialize suggested using it instead of rust.

mnkmnk · on Feb 23, 2020

How does this compare to Spark structured streaming? That too allows writing SQL on streams and having the state update incrementally.

gbrits · on Feb 18, 2020

Is this similar to TimescaleDB's Continuous Aggregates? Interested in knowing the overlap / differences.

benesch · on Feb 19, 2020

Similar in concept, but much more powerful in execution. We can incrementally materialize practically any SQL 92 query, with the killer features being joins and correlated subqueries. I’m not super familiar, but TimescaleDB’s continuous aggregates (just on a cursory glance) don’t support stacked aggregates, for example: https://github.com/timescale/timescaledb/issues/1400

rl3 · on Feb 19, 2020

Their careers page says it's built in Rust.

That's always nice to see, since Rust jobs are somewhat rare.

gbrits · on Feb 19, 2020

Looks promising! Can materialized views be backfilled?

vhold · on Feb 19, 2020

According to the documentation at https://materialize.io/docs/overview/architecture/

> Streaming sources must receive all of their data from the stream itself; there is no way to “seed” a streaming source with static data. However, you can union streaming and file sources in views, which accomplishes a similar outcome.

What I find odd is that the documentation doesn't show UNION support in the SELECT documentation : https://materialize.io/docs/sql/select/

Perhaps just an oversight in the documentation?

benesch · on Feb 19, 2020

Good catch, thanks! I've filed an issue about this: https://github.com/MaterializeInc/materialize/issues/2056

Though it's not yet documented, we do support all of the SQL set operations [0]: UNION [ALL], INTERSECT [ALL], and EXCEPT [ALL].

[0]: https://www.postgresql.org/docs/current/queries-union.html

sashavingardt2 · on Feb 18, 2020

I'm intrigued by the product. The SQL examples on the product page are atrocious though ((((

programmarchy · on Feb 18, 2020

Isn't BigQuery real-time as well? For me, the wow factor is that you can host this yourself.

chrisjc · on Feb 19, 2020

I'm unfamiliar with BQ, but from what I understand BQ doesn't have any concept of streams... with maybe an exception for change-streams (change tracking) and that's very different and less powerful than what Materialize is able to do.

After all, in BQ you're only able to query what has been already has been ingested or is being digested and that is somewhat by definition not 'real-time'.

manigandham · on Feb 19, 2020

GCP has PubSub which is their streaming/messagebus product. Cloud Dataflow is their stream/batch processing engine (backed by Apache Beam).

There's a new feature called "Dataflow SQL" that lets you submit a declarative SQL query through the BQ UI as a job to Dataflow which can join both BigQuery tables and PubSub topics in the same job.

https://cloud.google.com/dataflow/docs/guides/sql/dataflow-s...

chrisjc · on Feb 19, 2020

Thank you for the correction. Sounds like a very slick solution bringing batch and streaming (the B and EAM obviously) together.

manigandham · on Feb 19, 2020

BigQuery does have streaming ingest and can pickup new rows immediately but you still have to run a query to get the latest results.

This system is more like BigQuery with materialized views that are constantly updated as new data arrives.

tschmidleithner · on Feb 18, 2020

See also https://news.ycombinator.com/item?id=22346915

dang · on Feb 18, 2020

This comment was briefly downvoted and flagged because the convention on HN is only to link to past threads that have comments. Otherwise users click on the link, don't find the discussion they expected, and get ornery.

Since you're new and likely didn't know this, I've restored your post. (And welcome!)

justlexi93 · on Feb 19, 2020

Clickhouse has materilized views and is free.

vhold · on Feb 19, 2020

I think the biggest difference is that Materialize can do any kind of SQL join on many tables at once. Clickhouse materialized views can only reference one table.

What I'd like to know is if that would enable basically implementing social networks as just 3 tables and one materialized view, and how it would scale and perform.

Users, Relationships, Post, and a Feed materialized view that simply joins them together with an index of user_id and post_timestamp.

As relationships and messages are created or deleted, the feed view is nearly instantly updated. The whole entire view service logic then is just one really fast query. "select user,post,post_timestamp from feed where user_id = current_user and post_timestamp <= last_page_post_timestamp order by post_timestamp desc limit page_size"

wrmsr · on Feb 19, 2020

In the real world it is inevitably a lot more than just those 3 tables - add 'groups', different types of groups, different privacy settings, different per-user feed preferences, experiments, and any number of other things which _can_ be expressed as pure, normalized, joined tables in a matview but make naive approaches a lot less likely to actually work in prod.

In my experience the most successful approach to this is a midpoint - you materialize/denormalize enough to feed your app endpoints and search engines but retain flexibility in searching those fat but instantly available docs, and relatedly you also don't always need to preemptively materialize absolutely everything in any particular view - see https://engineering.fb.com/data-infrastructure/dragon-a-dist... . Without being able to transparently operate on arbitrarily partially populated matviews you are locked into a self-defeating all-or-nothing system that is likely to culturally do more harm than good with its rigidity. Imagine for example if there were no 'caches', just a binary choice of precomputing everything ahead of time or recomputing everything every time. Neither extreme is sufficient for all cases and real applications are comprised of many different points on that spectrum.

vhold · on Feb 19, 2020

That's fair. It will be interesting to see what people do along those lines, creating various materialized views, joining them at query time, chaining materialized views, and I think most important to your point, creating new kinds of sinks for the updates.

Right now Kafka is the only sink ( https://materialize.io/docs/sql/create-sink/ ), but because it integrates with Confluent's schema registry, I'm guessing it should work well with many of Confluent Connect's sinks, ( https://docs.confluent.io/current/connect/managing/connector... ). Elasticsearch would be an especially useful sink connector I think to your point.

I haven't used any of these things together, so right now I'm totally speculating on the potential.

What I'm mostly envisioning is that there are a lot of smaller scale applications where the complexity of adding an activity feed just isn't worth it. But if you could implement a feature like that trivially it could be game changing.