Hacker News new | past | comments | ask | show | jobs | submit login
Differential Datalog: a programming language for incremental computation (github.com/vmware)
170 points by jitl on Nov 8, 2022 | hide | past | favorite | 38 comments



Do note that the existing name in the CS literature for algorithms that can cheaply provide an "updated" result given a change in the input is dynamic, not differential. "Differential" actually refers to something else altogether, namely being able to automatically compute the gradient of some general calculation on continuous-valued inputs! (The intersection of differential and dynamic problems, where the point is to recompute answers quickly when some continuous-valued input changes, is called "kinetic".)

"Incremental" is actually a special case of dynamic, where all updates are restricted to 'adding' input data only. (Thus "incremental" algorithms, properly understood, are closely related to "online" algorithms, e.g. those which process input data in a streaming fashion as opposed to assuming a "complete" input to begin with.)


I figured the name was a pun on differential dataflow, and the readme seems to agree. Differential dataflow is a framework for doing "dynamic" computations by modelling them as dataflow with diffs of values rather than values flowing between nodes. The main user, materialize, has bags of tuples on most nodes and diffs are then tulles and the change to their counts. Differential data flow has a system called timely dataflow at its core and timely dataflow sends timestamp-like metadata with values to allow loops and for the output of the dynamic computation to be internally consistent


A.f.a.i.k the name is based on the underlying library and research on Dfferential Dataflow by F. McSherry https://www.microsoft.com/en-us/research/publication/differe... and https://timelydataflow.github.io/differential-dataflow/ ... I think the pitch is that this concept is enough improvement on incremental approaches it warrants a new name.

But I love these. From Functional Reactive Programming through propagator networks and embedding functional languages in datalog datafun-style, there seems to be this new-ish approach to computation that maybe somebody finally cleaves out of academia! The eve-lang guys were close but too ambitious. Maybe VMWare has less ambitious, more focused tool in mind.

Because datalogs are so cool! But only time I see them in practice are bespoke implementations for like ... compiler optimization tuning?


Differential here comes from differentiating map/reduce type multiset operators, which are then run/executed using Timely Dataflow.


What is the difference with https://github.com/rust-lang/datafrog? It’s a Datalog engine written by Frank McSherry on top of differential dataflow, that’s used here also


Datafrog doesn't use differential dataflow. Datafrog also requires you to do index management and joins manually, but it's very performant. Some of the join strategies are super optimized, but they don't allow for certain things that may be desirable, like negation/anti-join with a dynamic relation.


If you click around a little, you end up on a blog post with this tidbit:

> This project got put together rather suddenly, in response to some work the Rust folks are doing[1] on their new and improved borrow checker.

I don't think I could tell you more than "Frank wrote it to help rust folks who were previously doing work with differential-dataflow directly."

1. https://github.com/rust-lang/polonius/pull/36#issuecomment-3...


datafrog doesn't use differential-dataflow as far as I can tell.


One prior discussion:

https://news.ycombinator.com/item?id=26514456 - March 20, 2021 (62 comments)

A few other submissions, but all with 0 comments.


This seems like it would rock for building Language Server Protocol implementations or other similar things.

There's a tool in Rust called Salsa that's tailored for this, but it's galaxy-brained (I watched an hourlong introduction video and could still barely grasp it).

https://github.com/salsa-rs/salsa


Funny you say that, the Rust borrow checker tried to adopt differential-dataflow directly - the library behind this language - but Frank McSherry told them it's too complicated, let me build you an easier-to-use Datalog instead [blog]. So he built [datafrog] which is at the heart of the borrow checker today.

I'm not sure what the delta between Salsa and differential-dataflow is. My experience with differential-dataflow is that it's also galaxy brain.

[blog]: https://github.com/frankmcsherry/blog/blob/master/posts/2018...

[datafrog]: https://github.com/rust-lang/datafrog


> I watched an hourlong introduction video and could still barely grasp it

I can relate to that. What helped me to understand it though is looking through some projects using Salsa [1] and reading the document about how the Rust compiler uses queries [2]. You might also want to watch the discussion video with Anders Hejlsberg [3] that gives an overview of the problems that Adapton/Salsa try to solve.

[1] https://crates.io/crates/salsa/reverse_dependencies

[2] https://rustc-dev-guide.rust-lang.org/query.html

[3] https://learn.microsoft.com/en-gb/shows/seth-juarez/anders-h...


If you like this kind of thing, google Self-Adjusting Computation.

One advantage of self-adjusting computation vs. the differential dataflow approach is you can convert existing imperative code to it very easily. For example, a ray tracer with self-adjusting computation is written very similarly to a ray tracer without it.

There's a C++ library that implements parallel-friendly self-adjusting computation here: https://github.com/cmuparlay/psac

I see no reason why a Rust version couldn't be implemented.


Is "Self-Adjusting Computation" the Computer Science term for fine-grained automatic "reactive" subscription tracking like Mobx & Starbeam?

Is it the same idea as presented here? https://github-com.translate.goog/nin-jin/slides/blob/master...


I'm not familiar with that approach, but it appears to be in the same design space (but not the same design). I'd examine both if you have time.


Self adjusting computation is such a great framework.


Why wouldn't I just use differential dataflow, bindings to it or a higher level library using it in a language I know like Rust or Python?

I think differential dataflow is very interesting and could be very useful at scale for certain businesses but I just don't get the need for the lang.

And I'll plug for https://bytewax.io/ which is one company I know doing interesting things within the niche!

(I have no association to it besides having talked with the founder once upon a time.)


Learning DDLog was much easier for me than learning Rust and differential-dataflow. I tried writing some differential-dataflow a few years ago and it was beyond me. DDLog was very quick to pick up.


For everyone's context, differential datalog is built on the differential dataflow library.


Even more context:

SQL interface on Differential/Timely Dataflow = https://materialize.com/

Python interface on Differential/Timely Dataflow = https://bytewax.io/

Grossly oversimplified, but directionally correct.


Don’t know by looking at it if it is the related to or the same as my link below but had a class where the professor liked Datalog for some reason. Entire class found it hard to learn and not really useful, and not much more work to do similar things in other languages. Very niche but I know HN loves niche-y languages.

http://www.learndatalogtoday.org/

Maybe some useful scientific applications I am not aware of?


Compared to SQL, Datalog lets you break down queries into smaller queries that are reusable.

I think Datalog got some attention in the 1980s as a logically pure dialect of Prolog, then it practically disappeared from the literature for a while. I remember it being really hard to find anything about it around 2008, then when people got really tired of OWL and other semantic web tools Datalog came roaring back in the 2010s.

Like the semantic web tools people have a hard time recognizing bog-standard database technology when they have been slightly reskinned and Datalog has long had that problem. One issue is that SQL has this crazy idiosyncratic syntax that hides the fact that it has a simple set of operators behind it, in the case of Datalog the operators are quite directly exposed which people seem to find too simple to understand.


So is the functionality provided by something like SQLAlchemy attempting to cover the same use cases?


In some ways, but it is more dynamic, that is, you can write recursive queries in a natural syntax like the way you write recursive functions in a normal programming language. You don't have awful hacks like

https://www.postgresql.org/docs/current/queries-with.html


For the record, the thread which I think triggered this one to be posted and getting to the front page:

https://news.ycombinator.com/item?id=33518320

(On cozo, a new embedded SQLite like database with datalog as a language, written in Rust).


Tutorial which I didn’t see linked in the README: https://github.com/vmware/differential-datalog/blob/master/d...


It’s under “Documentation” in the readme.


One computational tool I want to see is arbitrary materialized views over computation

In databases views are really powerful how we can build virtual tables that are based on joins and other data.

The value comes when we can update the underlying data from the view. This is what I want for computation. I'm not sure if differential dataflow can provide this.


> I'm not sure if differential dataflow can provide this.

Yes, it can, but you will have to write the views yourself, in Rust.

Materialize (https://materialize.com) exists, though, and compiles SQL to differential-dataflow programs, in order to provide exactly what you're asking for.

(I work for Materialize)


Why does it have to be a new programming language? Why not a library/framework for an existing language?


As a general answer to that question, syntax is often under-valued.

We didnt get whole fields of mathematics until a clear syntax was developed; and likewise, what patterns of thinking a syntax affords makes a big difference to the designs one chooses.

The effect is bigger when writing not writing "glue code". The goal of threading data through libraries and applying simple business rules to it is a kind of problem best solved by existing mainstream syntaxes.

That doesn't mean, as is often implicitly assumed, that therefore all syntax is basically equivalent.


A DSL for differential-dataflow would be great. A big niche/yawning chasm for such a DSL to fill would be to allow for runtime dataflow programs (i.e., to have a parser/interpreter that then relies on differential-dataflow).


Looks awesome!

It has been a while since a new version has been released. Has anything been published regarding future plans? Or is the current version "feature complete" for the foreseeable future?


Update - I emailed one of the contributors, and it is in fact in maintenance mode:

> DDlog is now in maintenance mode, because we're working on a new streaming and [incremental computation framework][1] that will be more powerful in multiple ways. We still have customers using DDlog, so we fix issues as they come up, but not working on any new features.

[1]: https://github.com/vmware/database-stream-processor


I talked to some of the devs during HYTRADBOI (https://www.hytradboi.com/) in April and they seemed to be going strong - but in these trying times it’s hard to know.


Maybe I'm just clueless, but after looking at the README I have no idea what this software does. Can someone ELI5?


Lets you define queries over some data set declaratively, and instead of recomputing the query over the entire data set every time you want an updated answer, it uses Differential Dataflow <https://github.com/frankmcsherry/differential-dataflow> to efficiently(^1) calculate the new results by updating the results of the previous query execution in response to new updates to the data set.

^1: I'm not an expert on Differential Dataflow, so I don't know what "efficiently" means in this context, other than "should be faster than running the query from scratch."


it looks like the main application layer is a haskell code base, cool!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: