Differentiable Programming – A Simple Introduction

infogulch · on April 13, 2022

The most interesting thing I've seen on AD is "The simple essence of automatic differentiation" (2018) [1]. See past discussion [2], and talk [3]. I think the main idea is that by compiling to categories and pairing up a function with its derivative, the pair becomes trivially composable in forward mode, and the whole structure is easily converted to reverse mode afterwards.

[1]: https://dl.acm.org/doi/10.1145/3236765

[2]: https://news.ycombinator.com/item?id=18306860

[3]: Talk at Microsoft Research: https://www.youtube.com/watch?v=ne99laPUxN4 Other presentations listed here: https://github.com/conal/essence-of-ad

tome · on April 13, 2022

> the whole structure is easily converted to reverse mode afterwards.

Unfortunately it's not. Elliot never actually demonstrates in the paper how to implement such an algorithm, and it's very hard to write compiler transformations in "categorical form".

(Disclosure: I'm the other of another paper on AD.)

orbifold · on April 13, 2022

I think JAX effectively demonstrates that this is indeed possible. The approach they use is to first linearise the JAXPR and then transpose it, pretty much in the same fashion as the Elliot paper did.

tome · on April 14, 2022

A JAXPR is a normal functional-style expression tree, with free variables, like

    let b = f a in g (a, b)

Elliot's "compiling to categories" requires you to translate this to

    g ∘ (id × f) ∘ dup

It's pretty baffling to work with terms like the latter in practice! The main ideas that JAX is based on were already published in "Lambda the ultimate backpropagator" in 2008. Elliot's work is a nice way of conceptualising the AD transformations and understanding how they all relate to each other, but it's not particularly practical.

amkkma · on April 13, 2022

which paper?

yauneyz · on April 13, 2022

My professor has talked about this. He thinks that the real gem of the deep learning revolution is the ability to take the derivative of arbitrary code and use that to optimize. Deep learning is just one application of that, but there are tons more.

SleekEagle · on April 13, 2022

That's part of why Julia is so exciting! Building it specifically to be a differentiable programming language opens so many doors ...

mountainriver · on April 13, 2022

Julia wasn’t really built specifically to be differentiable, it was just built in a way that you have access to the IR, which is what zygote does. Enzyme AD is the most exciting to me because any LLVM language can be differentiable

SleekEagle · on April 13, 2022

Ah I see, thank you for clarifying. And thank you for bringing Enzyme to my attention - I've never seen it before!

celrod · on April 14, 2022

Enzyme.jl works quite well (but the possibility of using it across languages is appealing).

melony · on April 13, 2022

I am just happy that the previously siloed fields of operations research and various control theory sub-disciplines are now incentivized to pool their research together thanks to the funding in ML. Also many expensive and proprietary optimization software in industry are finally getting some competition.

SleekEagle · on April 13, 2022

Hm I didn't know different areas of control theory were siloed. Learning about control theory in graduate school was awesome and it seems like a field that would benefit from ML a lot. I know they use RL agents for control networks for e.g. cartpole, but I would've thought it would be more widespread! Do you think the development of Differentiable Programming (i.e. the observation of more generality beyond pure ML/DL) was really the missing piece?

Also, just curious, what are your studies in?

melony · on April 13, 2022

Control theory has a very, very long parallel history alongside ML. ML, specifically probabilistic and reinforcement learning, uses a lot of dynamic programming ideas and Bellman equations in its theoretical modeling. Lookup the term cybernetics, it is an old term in the pre-internet era to mean control theory and optimization. The Soviets even had a grand scheme to build networked factories that could be centrally optimized and resource allocated. Their Slavic communist AWS-meets-Walmart efforts spawned a Nobel laureate; Kantorovich was given the award for inventing linear programming.

Unfortunately the CS field is only just rediscovering control theory while it has been a staple of EE for years. However, there haven't been many new innovations in the field until recently when ML became the new hottest thing.

erwincoumans · on April 14, 2022

Interesting, I didn't know about cybernetics.

For the ones interested there is a book that discusses both: 'Reinforcement Learning and Optimal Control', by Dimitri P. Bertsekas. It covers exact and approximate Dynamic Programming, finite and infinite horizon problems, deterministic and stochastic models, model-based and model-free optimization.

Aside from this book, Ben Recht has some interesting blog about Optimal Control and Reinforcement learning: http://www.argmin.net/2018/06/25/outsider-rl

SleekEagle · on April 13, 2022

This is some insanely cool history! I had no idea the Soviets had such a technical vision, that's actually pretty amazing. I've heard the term "cybernetics" but honestly just thought it was some movie-tech term, lol.

It seems really weird that control theory is in EE departments considering it's sooo much more mathematical than most EE subdisciplines except signals processing. I remember a math professor of mine telling us about optimization techniques that control systems practitioners would know more about than applied mathematicians because they were developed specifically for the field, can't remember what the techniques were though ...

melony · on April 13, 2022

There is this excellent HN-recommended fiction called Red Plenty that dramatised the efforts on the other side of the Atlantic.

https://news.ycombinator.com/item?id=8417882

> It seems really weird that control theory is in EE departments considering it's sooo much more mathematical than most EE subdisciplines except signals processing.

I agree, apparently Bellman's reasoning for calling dynamic programming what it is was because he needed grant funding during the Cold War days and was advised to give his mathematical theories a more "interesting" name.

https://en.m.wikipedia.org/wiki/Dynamic_programming#History

The generalised form of the Bellman Equation (co-formulated by Kalman of the Kalman filters fame) to control theory and EE is in some ways what the Maximum Likelihood function is to ML.

https://en.m.wikipedia.org/wiki/Hamilton%E2%80%93Jacobi%E2%8...

SleekEagle · on April 13, 2022

Looks really cool, added to my amazon cart. Thanks for the rec!

That hilarious and sadly insightful. I remember thinking "what the hell is so 'dynamic' about this?" the first time I learned about dynamic programming. Although "memoitative programming" sounds pretty fancy too, lol

dreamcompiler · on April 17, 2022

Laplace transforms are one such trick. Given a linear differential equation describing your system, Laplace transforms let you solve it using basic algebra. Unfortunately this doesn't work on nonlinear systems.

melony · on April 17, 2022

The reduction of convolution to FFT is beautiful ;)

potbelly83 · on April 13, 2022

How do you differentiate a string? Enum?

6gvONxR4sf7o · on April 13, 2022

The answer to that is a huge part of the NLP field. The current answer is that you break down the string into constituent parts and map each of them into a high dimensional space. “cat” becomes a large vector whose position is continuous and therefore differentiable. “the cat” probably becomes a pair of vectors.

nerdponx · on April 14, 2022

It's weirder than that. You typically are differentiating a loss function of strings and various opaque weights. You are optimizing the loss function over the weight space, so in some informal contrived sense you are actually differentiating with respect to the string.

geysersam · on April 14, 2022

Not all functions are differentiable.

Sometimes there are other better ways to describe "how does changing x affect y". Derivatives are powerful but they are not the only possible description of such relationships.

I'm very excited for what other things future "compilers" will be able to do to programs besides differentiation. That's just the beginning.

titanomachy · on April 13, 2022

If you were dealing with e.g. English words rather than arbitrary strings, one approach would be to treat each word as a point in n-dimensional space. Then you can use continuous (and differentiable) functions to output into that space.

bmc7505 · on April 15, 2022

https://en.wikipedia.org/wiki/Brzozowski_derivative

adgjlsfhk1 · on April 13, 2022

generally you consider them to be piecewise constant.

tome · on April 13, 2022

Or more precisely, discrete.

choeger · on April 13, 2022

Nice article, but the intro is a little lengthy.

I have one remark, though: If your language allows for automatic differentiation already, why do you bother with a neural network in the first place?

I think you should have a good reason why you choose a neural network for your approximation of the inverse function and why it has exactly that amount of layers. For instance, why shouldn't a simple polynomial suffice? Could it be that your neural network ends up as an approximation of the Taylor expansion of your inverse function?

ChrisRackauckas · on April 14, 2022

It's a good question, which I answer in much more depth here: https://www.reddit.com/r/MachineLearning/comments/ryw53x/d_f... . Specifically that answer is about NNs vs fourier series, but it's the same point: polynomials, Fourier series, and NNs are all universal function approximators, so why use a NN? If you write out what happens though, you can easily see the curse of dimensionality in action and see why a 100-dimensional function shouldn't be approximated by tensor products of polynomials. See the rest there.

So it's not really about NNs vs polynomials/fourier/Chebyshev, etc., but rather what is the right thing to use at a given time. In the universal differential equations paper (https://arxiv.org/abs/2001.04385), we demonstrate that some examples of automatic equation discovery are faster using Fourier series than NNs (specifically there the discovery of a semilinear partial differential equation). That doesn't say that NNs are a bad way to do all of this, it just means that they are one trainable object that is good for high dimensional approximations, but libraries should allow you to easily move between classical basis functions and NNs to best achieve the highest performance.

choeger · on April 14, 2022

Great answer! Very much appreciated!

SleekEagle · on April 13, 2022

I think for more complicated examples like RL control systems a neural network is the natural choice. If you can incorporate physics into your world model then you'd need differentiable programming + NNs, right? Or am I misunderstanding the question.

If you're talking about the specific cannon problem, you don't need to do any learning at all you can just solve the kinematics, so in some sense you could ask why you're using any approximation function,

simulate-me · on April 14, 2022

Requiring an exact justification for a specific NN architecture is not productive. Simply put, such justification almost never exists, yet NN clearly out-perform simple polynomials in a wide variety of tasks. Even if you know that a NN performs better than a simple polynomial on a task, you're very unlikely to get an exact explanation on why a specific NN was chosen vs. the infinite number of other possible NNs. If you require such an explanation, then you're going to miss out on a lot of performance.

potatoman22 · on April 14, 2022

Most of the time we hear about neural networks because the linear models didn't work.

adgjlsfhk1 · on April 14, 2022

in low dimensional spaces, there are a lot of good techniques, but once you go above 10 or so, NN is generally the only option. Pretty much everything else has exponential degradation with more dimensions.

PartiallyTyped · on April 13, 2022

The nice thing about differentiable programming is that we can use all sorts of different optimizers compared to gradient descent that can offer quadratic convergence instead of linear!

SleekEagle · on April 13, 2022

Yes exactly! This is huge. Hessian optimization is really easy with JAX, haven't tried it in Julia though

ChrisRackauckas · on April 13, 2022

Here's Hessian-Free Newton-Krylov on neural ODEs with Julia: https://diffeqflux.sciml.ai/dev/examples/second_order_adjoin... . It's just standard tutorial stuff at this point.

PartiallyTyped · on April 13, 2022

And very fast given that you compile the procedure! I am considering writing an article on this and posting it here because I have seen enormous improvements over non jitted code, and that excluded jax.vmap.

SleekEagle · on April 13, 2022

There's a comparison of JAX with PyTorch for Hessian calculation here!

https://www.assemblyai.com/blog/why-you-should-or-shouldnt-b...

Would definitely be interested in an article like that if you decide to write it

applgo443 · on April 13, 2022

Why can't we use this quadratic convergence in deep learning?

PartiallyTyped · on April 13, 2022

Well, quadratic convergence usually requires the Hessian, or an approximation of it, and that's difficult to get in deep learning due to memory constrains, and difficulty computing second order derivatives.

Computing the derivatives is not very difficult with e.g. Jax, but ... you get back to the memory issue. The Hessian is a square matrix, so in Deep Learning, if we have a million of parameters, then the Hessian is a 1 trillion square matrix...

tome · on April 13, 2022

Not only does it have 1 trillion elements, you also have to invert it!

PartiallyTyped · on April 13, 2022

Indeed! BFGS (and derivatives) approximate the inverse but they have other issues that make them prohibitively expensive.

SleekEagle · on April 13, 2022

https://c.tenor.com/enoxmmTG1wEAAAAC/heart-attack-in-pain.gi...

ssivark · on April 14, 2022

To add, one could think of schemes like "momentum" and cousins as attempts to estimate something in the spirit of the inverse Hessian using various hacks/heuristics.

fennecs · on April 15, 2022

Does someone have an example where the ability to “differentiate” a program gets you something interesting?

I understand perfectly what it means for a neural network, but how about more abstract things.

Im not even sure as currently presented, the implementation actually means something. What is the derivative of a function like List, or Sort or GroupBy etc? These articles all assume that somehow it just looks like derivative from calculus somehow.

Approximating everything as some non smooth real function doesn’t seem entirely morally correct. A program is more discrete or synthetic. I think it should be a bit more algebraic flavoured, like differentials over a ring.

fghorow · on April 13, 2022

At first glance, this approach appears to re-invent an applied mathematics approach to optimal control. There, one writes a generalized Hamiltonian, from which forward and backward-in-time paths can be iterated.

The Pontryagin maximum (or minumum, if you define your objective function with a minus sign) principle is the essence to that approach to optimal control.

SleekEagle · on April 14, 2022

I've never heard of Pontryagin's maximum principle, but thanks for bringing it to my attention. I think my knowledge of dynamical systems and control theory isn't quite up-to-snuff at the moment to fully understand it, but it's bookmarked for another day! thanks again

noobermin · on April 13, 2022

The article is okay but it would have helped to have labelled the axes of the graphs.