> I just want to say this is a very dangerous assumption to make.
I think we're actually arguing the same points here. It's not that every use case needs single-digit millisecond latencies! There are plenty of use cases that are satisfied by batch jobs running every hour or every night.
But when you do need real-time processing, the current infrastructure is insufficient. When you do need single-digit latency, running your batch jobs every second, or every millisecond, is computationally infeasible. What you need is a reactive, streaming infrastructure that's as powerful as your existing batch infrastructure. Existing streaming infrastructure requires you to make tradeoffs on consistency, computational expressiveness, or both; we're rapidly evolving Materialize so that you don't need to compromise on either point.
And once you have streaming data warehouse in place for the use cases that really demand the single-digit latencies, you might as well plug your analysts and data scientists into that same warehouse, so you're not maintaining two separate data warehouses. That's what we mean by ideal: not only does it work for the systems with real-time requirements, but it works just as well for the humans with looser requirements.
To give you an example, let me respond to this point directly:
> Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data. So even if we could get the data in the warehouse in sub-minute latency, the jobs to transform that data ran every 5 minutes.
The idea is that you would have your analysts write these ETL pipelines directly in Materialize. If you can express the cleaning/de-duplication/aggregation/projection in SQL, Materialize can incrementally maintain it for you. I'm familiar with a fair few ETL pipelines that are just SQL, though there are some transformations that are awkward to express in SQL. Down the road we might expose something closer to the raw differential dataflow API [0] for power users.
I think what might be really unique here that people aren't imagining, are the new possible applications of having <100ms updates on complex materialized views.
With sufficiently expressive SQL and UDF support there are whole classes of stateful services that are performing lookups, aggregations, etc, that could be written as just views on streams of data. Experts who model systems in SQL, but aren't experts in writing distributed stateful streaming services would basically be able to start deploying services.
Are there any plans to support partitioned window functions, particularly lag(),lead(),first(),last() OVER() ? That would be remarkably powerful.
Window functions are a particular favorite of mine, but we haven’t seen much customer demand for them yet, so they haven’t been officially scheduled on the roadmap. They require some finesse to support in a streaming system, as you have to reconstruct the potentially large window whenever you receive new data. Probably some interesting research to be done here, or at least some interesting blog posts from Frank.
Please feel free to file issues about any of these functions that you’d like to see support for! We especially love seeing sample queries from real pipelines.
I have a strong suspicion that bitemporalism makes a lot of these problems less problematic. The actual volumes of data are the same, but the all-or-nothingness of windowing over very large data sets in order to avoid missing anything that arrived late goes away.
Sort of -- the problem I see in the event time / processing time distinction is that it's about instants rather than intervals. There are a number of models and queries that are not reliably expressible with instants alone, unless you reinvent intervals with them.
For example, if I rely on "updated-at" and infer that whatever record has the latest updated-at is the "current" record, then I may create the illusion that there are no gaps in my facts. That may not be so.
> For example, if I rely on "updated-at" and infer that whatever record has the latest updated-at is the "current" record, then I may create the illusion that there are no gaps in my facts. That may not be so.
I believe that notion is captured by timely's capabilities [0]. Your capability has a current time, and you can only produce records at or greater than the current time. So you could produce a record at, say, t + 3, then t + 5, and then produce a record at t + 1. But not until you downgrade your capability to t + 6 will the record at t + 5 be considered final; downgrading your capability is how you indicate that you have the correct and final set of facts for all times less than t.
If your events can arrive out of order forever, then you have a problem, as you'll never be able to downgrade your capability because you'll never be willing to mark a time as "final." That's where bitemporalism (as mentioned in that issue I linked previously) comes into play. You can mark a result as final as of some processing time, and then issue corrections as of some processing time in the future if some out-of-order data arrives. Materialize will (likely) gain support for bitemporalism eventually, and the underlying dataflow engine supports arbitrary-dimension timestamps already.
Would be happy to chat about this more, if you're curious, but I feel like this discussion is getting a bit unwieldy for an HN thread! (At the very least I might need to put you in touch with Frank.) Feel free to reach out on GitHub [1] or our Gitter [2], or shoot me an email at benesch@materialize.io.
The underlying compute framework, differential dataflow, supports multi-temporal timestamps. The Crux folks were at one point looking at it for Juxt, though not sure what they concluded.
This feels like the philosophical conclusion that Kafka Streams has made, i.e. you don't have a strict watermark, and if you really want you can theoretically keep updating and retracting data forever, and build a pipeline that magically stays in sync.
Partially, in my understanding, but not fully. An advantage of bitemporalism that is hard to recreate is queries about past and future states of belief. "What do I believe is true today?" works well with accumulation and reaction and with standard normalised schemata.
"What do I believe I believed yesterday?" is slightly harder and needs additional information to be stored. You can rewind a stream and replay it up to the point of interest, but that can be quite slow.
"What did I believe today would be, last week?", "What is the history of my belief about X?", "have I ever believed Y about X?" etc are much harder to answer quickly without full bitemporalism. So too the problem of having implicit intervals that are untrue, which is where "updated at" can be so misleading.
I think the average data team at a startup using something like Redshift/BigQuery/Snowflake is using window functions quite extensively when writing analytical queries/building data pipelines so I'm surprised to hear you haven't see much customer demand for them.
If you trying to wholesale replace a "traditional" batch oriented data warehouse like the ones I mentioned above I think building support for window functions would be essential.
To be clear, window functions are definitely on our radar! But you’d be surprised how many folks are delighted just to have more basic SQL features, like fully functional joins. Streaming SQL is surprisingly far behind batch SQL.
I could see a usecase for rapid automated A/B testing, using ML to react to performance metrics. Why have human editors on cnn.com when you can in real time do story selection based on view traffic?
That said, hope we as technologists can find some better use cases than just stealing more of people’s a attention.
I think we're actually arguing the same points here. It's not that every use case needs single-digit millisecond latencies! There are plenty of use cases that are satisfied by batch jobs running every hour or every night.
But when you do need real-time processing, the current infrastructure is insufficient. When you do need single-digit latency, running your batch jobs every second, or every millisecond, is computationally infeasible. What you need is a reactive, streaming infrastructure that's as powerful as your existing batch infrastructure. Existing streaming infrastructure requires you to make tradeoffs on consistency, computational expressiveness, or both; we're rapidly evolving Materialize so that you don't need to compromise on either point.
And once you have streaming data warehouse in place for the use cases that really demand the single-digit latencies, you might as well plug your analysts and data scientists into that same warehouse, so you're not maintaining two separate data warehouses. That's what we mean by ideal: not only does it work for the systems with real-time requirements, but it works just as well for the humans with looser requirements.
To give you an example, let me respond to this point directly:
> Secondly, almost all data is useless in its raw form. The analysts had to perform ELT jobs on their data in the warehouse to clean, dedupe, aggregate, and project their business rules on that data. These functions often require the database to scan over historical data to produce the new materializations of that data. So even if we could get the data in the warehouse in sub-minute latency, the jobs to transform that data ran every 5 minutes.
The idea is that you would have your analysts write these ETL pipelines directly in Materialize. If you can express the cleaning/de-duplication/aggregation/projection in SQL, Materialize can incrementally maintain it for you. I'm familiar with a fair few ETL pipelines that are just SQL, though there are some transformations that are awkward to express in SQL. Down the road we might expose something closer to the raw differential dataflow API [0] for power users.
[0]: https://github.com/TimelyDataflow/differential-dataflow