From the programming language point of view, what's interesting to me is that this means, I think, that there is a differentiation monad. This arises because monads are equivalent to continuations and hence anything you can express with continuations can also be expressed as a monad.
I haven't seen this discussed before, except in the context of category theory (https://ncatlab.org/nlab/show/differentiation) I'm not sure how this ties into the programming language point of view.
If you speak Scala, here's code for reverse mode differentiation as a monad. The flatMap / bind operation becomes the the chain rule.
If this "explanation" (scare quotes because it's not really an explanation, just a code dump) doesn't work for you let me know and I can try a different approach.
final case class AD[A](v: A, k: Double => Double) {
// Chain rule
def flatMap[B](f: A => AD[B]): AD[B] = {
val self = this
val next = f(v)
AD(next.v, (d: Double) => next.k(d) * self.k(d))
}
def +(that: AD[Double])(implicit ev: A =:= Double): AD[Double] =
AD(this.v + that.v, (d: Double) => this.k(d) + that.k(d))
def *(that: AD[Double])(implicit ev: A =:= Double): AD[Double] =
AD(this.v * that.v, (d: Double) => (this.k(d) * that.v) + (that.k(d) * this.v))
def sin(implicit ev: A =:= Double): AD[Double] =
this.flatMap(x => AD(Math.sin(x), (d) => Math.cos(x) * d))
def gradient: Double =
this.k(1.0)
}
object AD {
def pure(x: Double): AD[Double] =
AD(x, d => 1.0)
}
Sorry, I don't understand your notation. To me a monad is defined by two functions, pure and bind (aka flatMap). I don't understand how — × M corresponds to this.
Sorry, that doesn't help. I literally do not understand the notation. I do not know what — means in this context, I don't understand what role × plays here, etc.
"—" means "the parameter of the type constructor goes here". I believe in Scala you use underscore for this, at the value level at least. "×" is what mathematicians call the type constructor for the usual product type. In Haskell this is called (,). I don't know what it's called in Scala.
So, — × M means something like "type T a = T a M", in Haskell notation.
final case class Mon[A](v: A, m: M) {
def flatMap[B](f: A => Mon[B]): Mon[B] =
Mon(f(this.v).v, f(this.v).m * this.m) // * is M's binary operation
}
object Mon {
def pure(v: A): Mon[A] =
Mon(v, M.unit) // M.unit is M's identity element
}
Can any continuations work? I think that single-usage delimited continuations, as in E, are not able to implement the reverse-mode AD technique directly. This is because your continuations are called multiple times.
Thanks for this paper. Along with the string of "compiling to categories" papers, our understanding of AD has improved greatly.
I have not read the paper yet but, reading the abstract, the idea of using continuation to store the differentiation information is reminiscent of the technique used in Zygote[0].
I believe Zygote's approach is based on the backpropagator of Pearlmutter and Siskind. The back propagator approach seems much simpler to me, although the authors of this paper suggest that it requires non-local program transformations
> The implementation proposed by Pearlmutter and Siskind returns a pair of a value and a backpropagator ... Doing this correctly requires a non-local program transformation ... Further tweaks arerequired if a lambda uses variables from an outer scope ... In contrast to Pearlmutter and Siskind [2008], using delimited continuations enables reverse-mode AD with only local transformations. Any underlying non-local transformations are implicitly resolved by shift and reset.
I'll have to look more carefully to understand how the CPS version avoids non-local program transformations.
I don't find this very convincing. Continuations are well known to be completely mystifying. I wrote up what I believe to be a much simpler reverse mode AD transformation that uses only basic concepts:
For starters, what if you want to differentiate through code that traverses a collection, for example:
val x = ...
ys.map(y => y + x)
Each loop iteration needs to contribute a gradient update to x, which is defined in an outer scope. And what if y => y + x is not given as an inline lambda, but defined elsewhere. It doesn't seem like your blog posts discusses any such cases.
Correct, it doesn't discuss that case, indeed it doesn't treat user-defined functions at all. I just don't understand in what sense it doesn't treat binding or sharing.
I don't believe that continuations are a corner of programming language theory that is well-loved and and well-understood by programmers at large, are they? Even amongst experienced functional programmers they seem to be thought of as rather difficult. Oleg Kiselyov has written extensively on them but his work has not been brought to the masses in the same way that his work in iteratees has.
Ahhh! Right! See, the context here is important. Continuations are pretty widely known and important in the context of programming language implementation and theory, which is why I was confused by your comment. In the context where this paper was presented (to programming language researchers), making the link between delimited continuations _would_ demystify a bunch, and could allow for transferring of knowledge and techniques between the fields.
Having looked again, the authors do explain why they believe that continuations (as opposed to simply higher-order functions) are necessary. See my other comment at https://news.ycombinator.com/item?id=22347428.
I long for a program that can “dream”: Do offline analysis of the statistics from the previous run, speculate on different settings and replay them in a sandbox to find out.
With more and more cores coming, I expect to see this sort of thing happen once we accept that we can’t really use all of them.
Databases have perhaps done this the longest. I think this notion first came up for me when I was thinking about complex operations on large data structures in application memory, and the way that hand-coded tuning parameters rot over time, or are counterproductive for your largest customers.
I think a few compilers sprouted the ability to take perf logs from previous runs into account when compiling, maybe ten years back now, but nothing is there for off-line optimization in situ. And again my choice of input data will never be perfect for all of my users. I can tune for my biggest customer or be populist.
Defining a loss function comes to mind. Checkout JAX[1] or Autograd[2]. As this is essentially a different programming paradigm there are a ton of opportunities, but it is fresh enough such that there is little to no history of software engineering differentiable programs. Design patterns, rules of thumb, and standard practices have yet to be defined.
You can define ‘better’ as quantifiable metrics like execution time. Check out the use of program synthesis in compilers for a actual use case of something similar :)
In the case of a thermostat you could define a goal temperature for a given time range (e.g. 18C during the day, 10C at night) and then your loss is, say, the squared difference from goal and current temperature.
But I agree that in many cases it is difficult to define a goal.
My point was supposed to be that your goal will vary across time.
Week of vacation at home? Probably optimizing for comfort. Standard work week where you aren't home all day? Cost will be big.
Of course, getting costs in there could be a little tricky. Not impossible, just likely a ton of heuristics. And there is some level of stress where maintaining a temp might be cheaper than quickly getting to it, depending on how far you will vary based on being off. Which is just a long way to say it is easier to keep the house warm on warm days. :)
If you drive a lot, this would be akin to trying to find the optimum spot to hold the gas pedal. Sounds like something you could look for, but odds are high that it has to vary based on circumstances.
So, in the end, we aren't looking for a static parameter, but a system of dynamic parameters to keep in tune.
The usual way a thermostat works is the user specifies goal temperatures in time ranges and the thermostat tries to achieve them. E.g. our thermostat allows, IIRC, 4 different segments per day so we have morning (warmish), daytime (minimal heating), evening (warmish), night (minimal). The optimization problem them becomes one of achieving these goals.
To account for the dynamics / circumstances you can augment the input the thermostat receives from just the current temperature to also include information about the rate of change of temperature (velocity and acceleration). This is the idea behind PID controllers and it allows things like easing off the heating if the house is warming quickly so the thermostat doesn't overshoot the goal.
You're right that having an optimization system turns the problem from one of solving the problem to one of defining a metric that we directly care about.
For battery optimization use cases, e.g. screen brightness, one metric that works is minimising the amount of times a user goes and changes the settings (with reinforcement learning), and I think that could definitely be a starting point for thermostats.
Obviously, you also care about other things in a thermostat, e.g. energy usage (eg when you're not home), and maybe the amount of changes a user needs to make is not sufficiently tight supervision for a thermostat, but it's a starting point that does take personalization into account.
Optimization against a kinda-good-enough loss function is better than no optimization at all, which is what usually happens with the giant pile of random program parameters that every program relies on.
In particular, my challenge is that in most programs the "loss" function that is of interest to the user is actually not visible. Why I picked the thermostat example. It doesn't know if you are there or not. Or what the cost of electricity is right now. Or what the average temperature outside will be in 2 hours.
It can be made aware of all of these things, but that will itself be a large part of the complexity. Such that "just optimize this loss function" is not wrong. Just, I challenge how much it helps.
I think it can definitely help in the thermostat example. This was literally the original value proposition of the Nest Thermostat, and they went to some length to demonstrate that they were effective, using both internal and third-party studies. The claim is an average of $130 in energy savings per customer per year... Which translates directly into a positive effect for carbon emissions.
Exactly, I feel there is stuff there. I just don't feel it is as simple as a loss function. It generally is more capability and more data.
The nest is odd, to me, as I live without an air conditioner. So, at best, it can keep things off when we are not at home. But, you can't just turn on when we are home, as at that point it is too late to hit and maintain comfort before bed.
So, we are back to adding more to my house just to support the data and control this is looking to bolster. Which is almost certainly a net negative, all told. :(
I haven't seen this discussed before, except in the context of category theory (https://ncatlab.org/nlab/show/differentiation) I'm not sure how this ties into the programming language point of view.