Demystifying Differentiable Programming (2018)

noelwelsh · on Feb 16, 2020

From the programming language point of view, what's interesting to me is that this means, I think, that there is a differentiation monad. This arises because monads are equivalent to continuations and hence anything you can express with continuations can also be expressed as a monad.

I haven't seen this discussed before, except in the context of category theory (https://ncatlab.org/nlab/show/differentiation) I'm not sure how this ties into the programming language point of view.

edflsafoiewq · on Feb 17, 2020

Can you unpack this a little? What exactly is the monad?

noelwelsh · on Feb 17, 2020

If you speak Scala, here's code for reverse mode differentiation as a monad. The flatMap / bind operation becomes the the chain rule.

If this "explanation" (scare quotes because it's not really an explanation, just a code dump) doesn't work for you let me know and I can try a different approach.

  final case class AD[A](v: A, k: Double => Double) {
    // Chain rule
    def flatMap[B](f: A => AD[B]): AD[B] = {
      val self = this
      val next = f(v)
      AD(next.v, (d: Double) => next.k(d) * self.k(d))
    }

    def +(that: AD[Double])(implicit ev: A =:= Double): AD[Double] =
      AD(this.v + that.v, (d: Double) => this.k(d) + that.k(d))

    def *(that: AD[Double])(implicit ev: A =:= Double): AD[Double] =
      AD(this.v * that.v, (d: Double) => (this.k(d) * that.v) + (that.k(d) * this.v))

    def sin(implicit ev: A =:= Double): AD[Double] =
      this.flatMap(x => AD(Math.sin(x), (d) => Math.cos(x) * d))

    def gradient: Double =
      this.k(1.0)
  }
  object AD {
    def pure(x: Double): AD[Double] =
      AD(x, d => 1.0)
  }

tiarkrompf · on Feb 17, 2020

Hi Noel, can you clarify how this corresponds to reverse mode AD? To me it seems more like a delayed or on-demand version of forward AD.

edflsafoiewq · on Feb 17, 2020

Oh, that's the monad — × M where M is a monoid right? M being, in this case, the monoid of functions R -> R under point-wise multiplication.

Shouldn't M actually be a monoid under composition? ie. you should compose like next.k(self.k(d)), not multiply.

noelwelsh · on Feb 17, 2020

Sorry, I don't understand your notation. To me a monad is defined by two functions, pure and bind (aka flatMap). I don't understand how — × M corresponds to this.

tome · on Feb 17, 2020

— × M is the functor. In Haskell it would generally be used as "Writer M", but (M,) can also be used.

noelwelsh · on Feb 17, 2020

Sorry, that doesn't help. I literally do not understand the notation. I do not know what — means in this context, I don't understand what role × plays here, etc.

tome · on Feb 17, 2020

"—" means "the parameter of the type constructor goes here". I believe in Scala you use underscore for this, at the value level at least. "×" is what mathematicians call the type constructor for the usual product type. In Haskell this is called (,). I don't know what it's called in Scala.

So, — × M means something like "type T a = T a M", in Haskell notation.

edflsafoiewq · on Feb 17, 2020

In your notation it would be something like

  final case class Mon[A](v: A, m: M) {
    def flatMap[B](f: A => Mon[B]): Mon[B] =
      Mon(f(this.v).v, f(this.v).m * this.m)  // * is M's binary operation
  }
  object Mon {
    def pure(v: A): Mon[A] =
      Mon(v, M.unit)  // M.unit is M's identity element
  }

dunefox · on Feb 16, 2020

Also this: https://2019.ecoop.org/details/ecoop-2019-papers/11/Automati... by Simon Peyton Jones

tiarkrompf · on Feb 17, 2020

Co-author here - nice to see this discussed and happy to answer questions

lidHanteyk · on Feb 17, 2020

Can any continuations work? I think that single-usage delimited continuations, as in E, are not able to implement the reverse-mode AD technique directly. This is because your continuations are called multiple times.

Thanks for this paper. Along with the string of "compiling to categories" papers, our understanding of AD has improved greatly.

nestorD · on Feb 17, 2020

I have not read the paper yet but, reading the abstract, the idea of using continuation to store the differentiation information is reminiscent of the technique used in Zygote[0].

Is there some parenty between the ideas ?

[0]: https://arxiv.org/abs/1810.07951v4

tome · on Feb 17, 2020

I believe Zygote's approach is based on the backpropagator of Pearlmutter and Siskind. The back propagator approach seems much simpler to me, although the authors of this paper suggest that it requires non-local program transformations

> The implementation proposed by Pearlmutter and Siskind returns a pair of a value and a backpropagator ... Doing this correctly requires a non-local program transformation ... Further tweaks arerequired if a lambda uses variables from an outer scope ... In contrast to Pearlmutter and Siskind [2008], using delimited continuations enables reverse-mode AD with only local transformations. Any underlying non-local transformations are implicitly resolved by shift and reset.

I'll have to look more carefully to understand how the CPS version avoids non-local program transformations.

tome · on Feb 16, 2020

I don't find this very convincing. Continuations are well known to be completely mystifying. I wrote up what I believe to be a much simpler reverse mode AD transformation that uses only basic concepts:

http://h2.jaguarpaw.co.uk/posts/automatic-differentiation-wo...

samth · on Feb 17, 2020

Your notes, while simple, don't cover many of the things the paper does, such as binding or functions or sharing.

tome · on Feb 17, 2020

You're right they don't cover functions. I don't understand what about binding or sharing they don't deal with though. Could you elaborate?

tiarkrompf · on Feb 17, 2020

For starters, what if you want to differentiate through code that traverses a collection, for example:

    val x = ...
    ys.map(y => y + x)

Each loop iteration needs to contribute a gradient update to x, which is defined in an outer scope. And what if y => y + x is not given as an inline lambda, but defined elsewhere. It doesn't seem like your blog posts discusses any such cases.

tome · on Feb 18, 2020

Correct, it doesn't discuss that case, indeed it doesn't treat user-defined functions at all. I just don't understand in what sense it doesn't treat binding or sharing.

bjz_ · on Feb 17, 2020

> Continuations are well known to be completely mystifying.

Could you elaborate on what leads you to think this?

tome · on Feb 17, 2020

I don't believe that continuations are a corner of programming language theory that is well-loved and and well-understood by programmers at large, are they? Even amongst experienced functional programmers they seem to be thought of as rather difficult. Oleg Kiselyov has written extensively on them but his work has not been brought to the masses in the same way that his work in iteratees has.

http://okmij.org/ftp/continuations/index.html

bjz_ · on Feb 17, 2020

Ahhh! Right! See, the context here is important. Continuations are pretty widely known and important in the context of programming language implementation and theory, which is why I was confused by your comment. In the context where this paper was presented (to programming language researchers), making the link between delimited continuations _would_ demystify a bunch, and could allow for transferring of knowledge and techniques between the fields.

tome · on Feb 17, 2020

Having looked again, the authors do explain why they believe that continuations (as opposed to simply higher-order functions) are necessary. See my other comment at https://news.ycombinator.com/item?id=22347428.

sdenton4 · on Feb 16, 2020

I dream of a day where I can write a program with some stupid thresholds/flags, run it for a while, and automatically learn better thresholds...

hinkley · on Feb 16, 2020

I long for a program that can “dream”: Do offline analysis of the statistics from the previous run, speculate on different settings and replay them in a sandbox to find out.

With more and more cores coming, I expect to see this sort of thing happen once we accept that we can’t really use all of them.

ignoramous · on Feb 17, 2020

> Do offline analysis of the statistics from the previous run, speculate on different settings and replay them

Reminds me of Dynimize https://news.ycombinator.com/item?id=17025627

hinkley · on Feb 17, 2020

Databases have perhaps done this the longest. I think this notion first came up for me when I was thinking about complex operations on large data structures in application memory, and the way that hand-coded tuning parameters rot over time, or are counterproductive for your largest customers.

I think a few compilers sprouted the ability to take perf logs from previous runs into account when compiling, maybe ten years back now, but nothing is there for off-line optimization in situ. And again my choice of input data will never be perfect for all of my users. I can tune for my biggest customer or be populist.

joe_the_user · on Feb 16, 2020

That sounds appealing. How would the program know what "better" was?

aaron-santos · on Feb 16, 2020

Defining a loss function comes to mind. Checkout JAX[1] or Autograd[2]. As this is essentially a different programming paradigm there are a ton of opportunities, but it is fresh enough such that there is little to no history of software engineering differentiable programs. Design patterns, rules of thumb, and standard practices have yet to be defined.

[1] https://github.com/google/jax

[2] https://github.com/HIPS/autograd

exdsq · on Feb 17, 2020

You can define ‘better’ as quantifiable metrics like execution time. Check out the use of program synthesis in compilers for a actual use case of something similar :)

saagarjha · on Feb 16, 2020

You’d profile for the metric you cared about, like execution time or memory usage.

taeric · on Feb 16, 2020

This has an odd implication that the metric you care about will be static.

Consider your thermostat. What do you optimize for daily with it? Is it the same in winter as summer? Why not?

To that end, why do we think there is an "optimum" setting?

noelwelsh · on Feb 16, 2020

In the case of a thermostat you could define a goal temperature for a given time range (e.g. 18C during the day, 10C at night) and then your loss is, say, the squared difference from goal and current temperature.

But I agree that in many cases it is difficult to define a goal.

taeric · on Feb 16, 2020

My point was supposed to be that your goal will vary across time.

Week of vacation at home? Probably optimizing for comfort. Standard work week where you aren't home all day? Cost will be big.

Of course, getting costs in there could be a little tricky. Not impossible, just likely a ton of heuristics. And there is some level of stress where maintaining a temp might be cheaper than quickly getting to it, depending on how far you will vary based on being off. Which is just a long way to say it is easier to keep the house warm on warm days. :)

If you drive a lot, this would be akin to trying to find the optimum spot to hold the gas pedal. Sounds like something you could look for, but odds are high that it has to vary based on circumstances.

So, in the end, we aren't looking for a static parameter, but a system of dynamic parameters to keep in tune.

noelwelsh · on Feb 17, 2020

The usual way a thermostat works is the user specifies goal temperatures in time ranges and the thermostat tries to achieve them. E.g. our thermostat allows, IIRC, 4 different segments per day so we have morning (warmish), daytime (minimal heating), evening (warmish), night (minimal). The optimization problem them becomes one of achieving these goals.

To account for the dynamics / circumstances you can augment the input the thermostat receives from just the current temperature to also include information about the rate of change of temperature (velocity and acceleration). This is the idea behind PID controllers and it allows things like easing off the heating if the house is warming quickly so the thermostat doesn't overshoot the goal.

Eridrus · on Feb 17, 2020

You're right that having an optimization system turns the problem from one of solving the problem to one of defining a metric that we directly care about.

For battery optimization use cases, e.g. screen brightness, one metric that works is minimising the amount of times a user goes and changes the settings (with reinforcement learning), and I think that could definitely be a starting point for thermostats.

Obviously, you also care about other things in a thermostat, e.g. energy usage (eg when you're not home), and maybe the amount of changes a user needs to make is not sufficiently tight supervision for a thermostat, but it's a starting point that does take personalization into account.

sdenton4 · on Feb 16, 2020

Optimization against a kinda-good-enough loss function is better than no optimization at all, which is what usually happens with the giant pile of random program parameters that every program relies on.

taeric · on Feb 17, 2020

Can be. Can also be just as wrong.

In particular, my challenge is that in most programs the "loss" function that is of interest to the user is actually not visible. Why I picked the thermostat example. It doesn't know if you are there or not. Or what the cost of electricity is right now. Or what the average temperature outside will be in 2 hours.

It can be made aware of all of these things, but that will itself be a large part of the complexity. Such that "just optimize this loss function" is not wrong. Just, I challenge how much it helps.

sdenton4 · on Feb 17, 2020

I think it can definitely help in the thermostat example. This was literally the original value proposition of the Nest Thermostat, and they went to some length to demonstrate that they were effective, using both internal and third-party studies. The claim is an average of $130 in energy savings per customer per year... Which translates directly into a positive effect for carbon emissions.

https://storage.googleapis.com/nest-public-downloads/press/d...

So, if you throw a whole lot of effort at it today, it can work well! I just want this sort of thing to be more automatic.

taeric · on Feb 17, 2020

Exactly, I feel there is stuff there. I just don't feel it is as simple as a loss function. It generally is more capability and more data.

The nest is odd, to me, as I live without an air conditioner. So, at best, it can keep things off when we are not at home. But, you can't just turn on when we are home, as at that point it is too late to hit and maintain comfort before bed.

So, we are back to adding more to my house just to support the data and control this is looking to bolster. Which is almost certainly a net negative, all told. :(

DarmokJalad1701 · on Feb 17, 2020

I was able to follow till the end of Section 2.1. Was completely lost afterwards.