OK, Deep Learning has outlived its usefulness as a buzz-phrase.
Deep Learning est mort. Vive Differentiable Programming!
Yeah, Differentiable Programming is little more than a rebranding of the modern collection Deep Learning techniques, the same way Deep Learning was a rebranding of the modern incarnations of neural nets with more than two layers.
But the important point is that people are now building a new kind of software by assembling networks of parameterized functional blocks and by training them from examples using some form of gradient-based optimization.
An increasingly large number of people are defining the network procedurally in a data-dependant way (with loops and conditionals), allowing them to change dynamically as a function of the input data fed to them. It's really very much like a regular progam, except it's parameterized, automatically differentiated, and trainable/optimizable. Dynamic networks have become increasingly popular (particularly for NLP), thanks to deep learning frameworks that can handle them such as PyTorch and Chainer (note: our old deep learning framework Lush could handle a particular kind of dynamic nets called Graph Transformer Networks, back in 1994. It was needed for text recognition).
People are now actively working on compilers for imperative differentiable programming languages. This is a very exciting avenue for the development of learning-based AI.
Important note: this won't be sufficient to take us to "true" AI. Other concepts will be needed for that, such as what I used to call predictive learning and now decided to call Imputative Learning. More on this later....
As much of a pity that, 70 years later, we are still using transistor based computers originally derived from three wires stuck in a piece of rock[1] by some very innovative fellows at Bell Labs[2]?
As much of a pity as that after 3 billion years, all life is still based upon selection bias and random genetic mutations, a brute force trial and error?
The algorithms which allow AlphaGo or the Libratus Poker AIs to achieve superhuman play have a direct correspondence with natural selection (and learning rate with selection strength). There is idea sharing between evolutionary game theory and learning, up to algorithms to play extensive form games with imperfect information such as this, based on replicator dynamics: http://dl.acm.org/citation.cfm?id=2617448
Evolution, per genome trajectory, also improves in its ability to evolve, as seen in evolvability.
Returning to the grandparent's original lament on stochastic gradient descent, I do agree with them. I suspect the need for better has not been seen due to non-supervised and incremental learning remaining minor areas of study. Artificial separation between learning and prediction allows "hacks" like batch-norm to somewhat suffice. It does seem unlikely that we will never need to take into account curvature of information manifolds while learning. Note that evolution uses curvature too.
Anything order than SGD has proven expensive but there are promising approximations, as found in KFAC or Projected Natural Gradient Descent, allowing me to close this post with yet another link between learning and evolution.
Following a gradient is smarter than trial and error. You can make an argument that, in high-dimensional parameter spaces, it’s hard to do better (because, gradient descent is linear in the number of dimensions).
Ordinary Metropolis-Hastings, for example, is closer to trial and error.
> You can make an argument that, in high-dimensional parameter spaces, it’s hard to do better (because, gradient descent is linear in the number of dimensions).
Nitpick: If you're interested in solving optimisation problems, it's very easy to do better than gradient descent. Gradient descent performs very poorly when the directional curvature of the objective function varies too much with the direction. Newton's method, quasi-Newton methods, and nonlinear conjugate gradient are some of the more ingenious, beautiful, and clean ways; there are also some dirty, hacky ways to go.
It is a little bit interesting that fancier optimisation algorithms than gradient descent are unnecessary or unhelpful in some large-scale applications.
You mention Newton’s method, but of course that requires second order information which, as I mentioned, is not generally workable in high dimensions. You have to be careful with quasi-Newton methods like conjugate gradient for the same reason.
> You mention Newton’s method, but of course that requires second order information which, as I mentioned, is not generally workable in high dimensions.
Why would you say that second-order information is "not generally workable in high dimensions"? We regularly run Newton's method on problems with tens of millions of variables today.
And Newton's method isn't the only way to use second-order information. It is easy to access, for example, Hessian-times-vector information using the same reverse-mode differentiation that's so popular today, using only a constant factor more time.
> You have to be careful with quasi-Newton methods like conjugate gradient for the same reason.
I meant, in general settings (ie, no special problem structure), that you need full Hessians for Newton’s method. And, regarding conjugate gradient, in (non-DL) settings that I’m used to, that for good results you need preconditioners which are also second order.
Could you provide a reference to a 10^7 size problem that is being optimized with Newton’s method? I’d be indebted.
> I meant, in general settings (ie, no special problem structure), that you need full Hessians for Newton’s method. And, regarding conjugate gradient, in (non-DL) settings that I’m used to, that for good results you need preconditioners which are also second order.
Yep, but often sparsity is present or the objective function is reasonably well-behaved. The former can save straight Newton; the latter will make nonlinear CG or quasi-Newton methods converge rapidly. (Quasi-Newton methods build a simple model of the Hessian from gradient and step information, usually using repeated low-rank modifications of a diagonal matrix. There are variants, like L-BFGS, that have an explicit, tunable limit on the number of additional vectors to be stored. This work really well for some reason---usually far better than gradient descent, and almost never more than a small constant factor slower)
> Could you provide a reference to a 10^7 size problem that is being optimized with Newton’s method? I’d be indebted.
Interior-point methods for linear programming form the Hessian of a barrier function at each iteration, then (sparse) Cholesky factorise it, then do a few backsolves to find a search direction. (Special hacks generally go into these "few backsolves" to speed convergence.) This is a damped Newton method. Commercial implementations include CPLEX, Gurobi, and MOSEK; there are a great many less-commercial implementations as well.
Chih-Jen Lin's LIBLINEAR uses a truncated Newton method (solve Hessian*direction=gradient by linear conjugate gradient, stopping early when sufficient accuracy has been obtained to take a step) to create linear classifiers for data.
Its simplicity is its power. More complex methods (e.g. second order methods) tend to get attracted to saddle points and produce bad results. Some metaheuristics like evolution strategies are also used in some specific cases (reinforcement learning). Minibatch gradient descent + reasonable minibatch size + some form of momentum is the best we have.
I think the primary reason that such methods are not used much in practice is memory and computational cost: each function evaluation is expensive and you need to solve a very large system at every iteration.
Also to reply to a sibling comment, you can add momentum and step length adjustments to second-order methods in much the same way as in steepest-descent to help escape saddles. The only difference is how the descent direction is chosen for the optimization.
Second order methods are attracted to saddle points in high dimensional spaces. The math and practice of optimizing these surfaces has a lot of nuances like this so much of the stuff you learn in your convex optimization class doesn't apply too well.
Do you have any recommendations on sources to read about this? Everything I've read discusses the use of the Hessian to not only determine you are at a saddle point but to also use its eigenvalues to escape.
the question is - why do you need to optimize in the first place? why don't you look up an answer instead of solving a mathematical optimization problem?
Assuming that AI tries to mimic the way humans learn and evolve, those methods haven't changed for hundreds of thousands of years and brute-force trial and error is just one of them. It's kind of fundamental...
You may have seen DeepMind's results last year where it trained 3D models to move through space in different ways, entitled "Emergence of Locomotion Behaviours in Rich Environments" ( https://arxiv.org/pdf/1707.02286.pdf , https://www.youtube.com/watch?v=hx_bgoTF7bs&feature=youtu.be). If you have a look in the paper, the method they use "Proximal Policy Optimization" is a great example of differentiable programming that does not include a neural network. I actually realized this last month when I was preparing a talk on deep learning, because I thought it used deep neural nets in its application, but found that it didn't.
Scanning through the paper, I see this "We structure our policy into two subnetworks, one of which
receives only proprioceptive information, and the other which receives only exteroceptive information.
As explained in the previous paragraph with proprioceptive information we refer to information
that is independent of any task and local to the body while exteroceptive information includes a
representation of the terrain ahead. We compared this architecture to a simple fully connected neural
network and found that it greatly increased learning speed."
It seems to me they do use neural nets. Proximal Policy Optimization is just a more novel way of optimizing them.
I wish we could come up with a catchier name, but I LOVE the idea of calling this programming, because that is precisely what we do when we compose deep neural nets.
For example, here's how you compose a neural net consisting of two "dense" layers (linear transformations), using Keras's functional API, and then apply these two layers to some tensor x to obtain a tensor y:
f = Dense(n)
g = Dense(n)
y = f(g(x))
This looks, smells, and tastes like programming (in this case with a strong functional flavor), doesn't it?
Imagine how interesting things will get once we have nice facilities for composing large, complex applications made up of lots of components and subcomponents that are differentiable, both independently and end-to-end.
NN are _just_ transfer functions. Look up tables. Or really dense maps. So f(g(x)) make total sense. But I dont think these are the interesting combinations. I think giving one NN the training experience of another, plus the feedback on "correct inference" will be when on NN trains its replacement.
Edit: I don't mean Marcus inspired the term differentiable programming; he inspired LeCun to emphasize the wider scope of deep learning after Marcus attacked it. In fact, LeCun liked a post on twitter rebutting Marcus' paper that also talks about differentiable programming: https://twitter.com/tdietterich/status/948811917593780225
I think LeCun doesn't want a repeat of AI winter because of exponentially rising hype and expectations out of Deep Learning. There have been few examples like Selena which he seems to think that people are trying to ride the deep learning wave to generate false buzz (and cash!) for themselves.
He does oppose Marcus' views, but he also knows neural nets are only one approach to differentiable programming. The term is confusing though. It should read like "linear programming" does, but people are not interpreting it that way.
> Differentiable Programming is horrible branding. It's hard to say, not catchy, and not as easily decipherable
Tell that to the people who deliberately popularized the term Dynamic Programming for something that was neither dynamic nor programming.
____
(From Wiki)
Bellman explains the reasoning behind the term dynamic programming in his autobiography, Eye of the Hurricane: An Autobiography (1984, page 159). He explains:
"I spent the Fall quarter (of 1950) at RAND. My first task was to find a name for multistage decision processes. An interesting question is, Where did the name, dynamic programming, come from? The 1950s were not good years for mathematical research. We had a very interesting gentleman in Washington named Wilson. He was Secretary of Defense, and he actually had a pathological fear and hatred of the word research. I’m not using the term lightly; I’m using it precisely. His face would suffuse, he would turn red, and he would get violent if people used the term research in his presence. You can imagine how he felt, then, about the term mathematical. The RAND Corporation was employed by the Air Force, and the Air Force had Wilson as its boss, essentially. Hence, I felt I had to do something to shield Wilson and the Air Force from the fact that I was really doing mathematics inside the RAND Corporation. What title, what name, could I choose? In the first place I was interested in planning, in decision making, in thinking. But planning, is not a good word for various reasons. I decided therefore to use the word “programming”. I wanted to get across the idea that this was dynamic, this was multistage, this was time-varying. I thought, let's kill two birds with one stone. Let's take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that it's impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It's impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to. So I used it as an umbrella for my activities."
>> Let's take a word that has an absolutely precise meaning, namely dynamic, in the classical physical sense. It also has a very interesting property as an adjective, and that it's impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It's impossible.
Now I have to try:
"Dynamic, multimodal failure" (fail).
"Dynamic instigation of pain for information retrieval" (torture).
"Dynamic evisceration of underage humans" (slaughtering of children).
"Dynamic destruction of useful resources" (environment destruction).
"An algorithm for calculating dynamic stool-rotor collision physics" (shit hits the fan).
Not terribly good I guess but I think not that bad either.
I like it, at least from the little that I read about it. The name describes the core of what it is: differentiate programs, in order to figure out how changes the the program affect the output and using that for optimization purposes. Do we really need to invent obscure, new names for everything just so that it sounds catchy?
> 2. Isn't the evolution of Deep Networks more advance setups such as GANs, RNNs, and so on?
That's not what Yann LeCun is getting at I think. Most neural networks models are sort of like a non-programmable computer. They are built around assuming the data is a fixed size, fixed dimensional array. But in computer programming there is an ocean of data structures. You have sorted sets, linked lists, dictionaries, and everything else. Imagine that we knew that a data-set was arranged as a sort of "fuzzy" dictionary and we wanted the computer to do the rest. All we need to do is load up the right deep neural network (I mean differential programming something or other) and wallah.
Something like where the the value in a piece of data dictate the layers that get stacked together and how those layers connect to layers for the next value in that piece of data
it seems like it's made by analogy with "probabilistic programming", i.e. defining complex probability distributions by writing familiar-looking imperative code (with loops and whatnot).
I think the idea is that thinking in terms of passing data through layers in a graph is cumbersome sometimes, and that expressing it as a "regular" program that just happens to come with gradients could be more comfortable.
I'd argue that GANs in particular are a natural fit for this style. The training procedure doesn't really fit exactly into the standard "minimize loss function of many layers using backprop".
Just a rebranding, though necessary one. Deep Learning is not really all about 'deep' anymore, many successful models don't really need a lot of layers.
Is a single SGD layer a neural net? Is an image filter or an audio filter a neural net? Is matrix multiplication a neural net? This would strain the intended definition even farther than it's already been strained.
But all of those are differentiable programming, and rightly so because they're all pieces that you use and compose together to make interesting learning mechanisms, including the ones that we vaguely refer to as "deep learning" now.
I like the terminology. It's not about what the original long-abandoned motivation for the design was ("neural"). It's not about how gratuitously complex you can make it ("deep"). "Differentiable" is about how it works and how we design it.
I'm not really buying your argument here. Neural networks are just a collection of artificial neurons. There is no requirement for multiple layers or depth of any kind.
Basically you're implying binary neurons, neuroevolution (which works on non-differentiable functions) etc. aren't a thing. Or at least they're not (working with) neural nets.
It's almost like SGD just made decades of AI research into neural networks just vanish.
Nobody is claiming that the definition of "differentiable programming" should be identical to the definition of "neural net". The claim is, if you want to assign a name to the thing that TensorFlow, PyTorch, and similar frameworks do, it's "differentiable programming".
If you want to make a non-differentiable neural net, knock yourself out. The research still exists and nobody is stopping you.
But while we're talking about terminology, I'd encourage you to stop referring to the units as "neurons". The false analogy to biology just confuses people.
I’m in favor of getting rid of „neural“. That would help dispense with all the unhelpful discussions about how different DNNs work compared to the human brain.
i remember people jokingly referring to stuff like word2vec (one layer, millions of dimensions) as "wide learning". this is definitely better branding than that.
But differentiable programming would exclude deep neural nets trained by evolutionary methods/genetic algorithms since those are gradient free. With the term deep learning I think the focus is correctly on the “deep” (compositional) nature of these models and not necessarily the training algorithm, of which there are many.
Interesting comment regarding Conal Elliott. I've always thought there is some similarity between probabilistic programming (specify a probabilistic model as a graph), functional reactive programming (specify some reactivity as a graph) and deep learning (specify some linear algebra / calculus / optimization operations as a graph). Too bad the word "graphical programming" would be interpreted as "visual programming" (or programming using plots and charts!) and not programming using an explicit graph structure.
I'm seriously pondering what this means for PL research. There has been some work in probalistic programming languages, and a significant part of the community would like to avoid imperative features. However, it seems like this is a chance for some real invigoration in PL research agendas.
working with dynamic neural net libraries that have autodiff (PyTorch, regular Torch, and I think also MXNet now) feels a lot like this. you just write normal functions, but after you execute them you can ask for gradients too.
How long did you wait on the page? At first when it loads you can read it just fine, but then an obnoxious pop up appears. There is a little "not now" link on the bottom that will get you back to the post, but I can see how someone would miss that (it's a great example of a bad UI).
I think it’s a good example of UX that encourages visitor to sign up - which is why it exists. A part of what makes it good is precisely what you mentioned: very easy to miss the button
Agreed. I think many posters here assume that everyone has an account at FB, WSJ, NYT, and so on. The web search trick doesn't work for all of these situations either.
reg/pay-walled sites are no the only kind that should be discouraged - sites that will track you globally unless you block them globally should be discouraged too IMHO. Facebook fits both categories.
You need to up your crap-blocking game. This bookmarklet can help:
javascript:(function() {
(function () {
var i, elements = document.querySelectorAll('body *');
for (i = 0; i < elements.length; i++) {
if (getComputedStyle(elements[i]).position === 'fixed') {
elements[i].parentNode.removeChild(elements[i]);}}})()})()
“most socially apt people should have Facebook accounts”
This is like saying that most people who know a lot about food should regularly visit McDonalds since it is the most eaten-at restaurant.
I would venture the opposite, people who are socially apt and attuned are probably least likely to be on Facebook... they’ve likely learned enough about it to stay far, far, away.
No. E.g. I'm fine with Telegram & Signal + WhatsApp & Skype for the elderly. Having no intention to become a public person, do I really need a public profile on a web site that's whole purpose is to spy on me everywhere, analyse my behavior and contacts, sell the data to others and show me ads?
I absolutely believe you. When I actually need to voice-call somebody (e.g. mom, granny or a client company CEO, all the other people I meet prefer texting) I use Skype - it does this much better than any of the competitors. So it sounds fairly probable it does video better than others too.
OK, Deep Learning has outlived its usefulness as a buzz-phrase. Deep Learning est mort. Vive Differentiable Programming!
Yeah, Differentiable Programming is little more than a rebranding of the modern collection Deep Learning techniques, the same way Deep Learning was a rebranding of the modern incarnations of neural nets with more than two layers.
But the important point is that people are now building a new kind of software by assembling networks of parameterized functional blocks and by training them from examples using some form of gradient-based optimization.
An increasingly large number of people are defining the network procedurally in a data-dependant way (with loops and conditionals), allowing them to change dynamically as a function of the input data fed to them. It's really very much like a regular progam, except it's parameterized, automatically differentiated, and trainable/optimizable. Dynamic networks have become increasingly popular (particularly for NLP), thanks to deep learning frameworks that can handle them such as PyTorch and Chainer (note: our old deep learning framework Lush could handle a particular kind of dynamic nets called Graph Transformer Networks, back in 1994. It was needed for text recognition).
People are now actively working on compilers for imperative differentiable programming languages. This is a very exciting avenue for the development of learning-based AI.
Important note: this won't be sufficient to take us to "true" AI. Other concepts will be needed for that, such as what I used to call predictive learning and now decided to call Imputative Learning. More on this later....