Gradient-Based Hyperparameter Optimization Through Reversible Learning

taliesinb · on Feb 2, 2016

> The last remaining parameter to SGD is the initial parameter vector. Treating this vector as a hyperparameter blurs the distinction between learning and meta-learning. In the extreme case where all elementary learning rates are set to zero, the training set ceases to matter and the meta-learning procedure exactly reduces to elementary learning on the validation set. Due to philosophical vertigo, we chose not to optimize the initial parameter vector.

Comedy gold.

duvenaud · on Feb 2, 2016

Authors here, feel free to ask us anything!

zackmorris · on Feb 2, 2016

First off, thanks for your efforts!

From what I've followed of machine learning, I feel like there are areas that are not very rigorous, for example where humans have to tune weights and parameters. Does your work address those places?

Specifically do you think you can train on enough parameters, and compare enough results, to begin to estimate how that tuning can be done automatically?

duvenaud · on Feb 2, 2016

Yes, gradient-based hyperparameter optimization is one option for taking the human out of the loop to some extent. Of course, we can't avoid having to specify hyper-hyper-parameters!

More generally, automatic hyperparameter tuning (without gradients) is already a standard part of many researchers' pipelines.

It's also part of good scientific practice, since it makes it harder to bias the results towards a particular method.

taliesinb · on Feb 2, 2016

Very stimulating paper!

You kind of allude to things like this with the "discrete parameters" comment, but here's how I imagine one could use this to meta-optimize layer sizes, which are after all continuousish: make all layers too large, then compute the hyper-derivative dn_i of a per-layer cutoff that starts regularizing units more and more strongly beyond some unit index n_i for each layer i (as in the regularization strength is sigmoidal, or better exponential, centered around n_i). Does that make sense?

Also, I immediately wanted to see vector gradient field plots in hyperparameter space of a bunch of toy models. It'd be fun to actually see "hyperconvexity".

duvenaud · on Feb 2, 2016

Thanks! Yes, great suggestion. One could also use a similar trick to learn the number of layers, by regularizing unused layers to be close to the identity function.

One thing that we wanted to try but didn't have time was to learn a generalization of convolution. Since convnets are a special case of fully-connected nets with some weights tied and some set to zero, we were thinking of ways to parameterize this weight tying that included both fully-connected and convnets as a special case.

taliesinb · on Feb 2, 2016

> One thing that we wanted to try but didn't have time was to learn a generalization of convolution. Since convnets are a special case of fully-connected nets with some weights tied and some set to zero, we were thinking of ways to parameterize this weight tying that included both fully-connected and convnets as a special case.

I see, that's cool. So you could define a 'symmetry operator' the defines a regularization term as e.g. k * sum_{ij}[(w_{ij} - w_{i+i',j+j'})^2] (modulo my hand-wavy tensor indexing). Then the dk tells you how much a layer wants to be a convolution and di', dj' tells you what convolution kernel size it wants to be.

(edit: made less badly articulated)

duvenaud · on Feb 2, 2016

Exactly. The problem is that it's hard to find a good balance between having too many hyperparameters, versus basically telling it the answer beforehand. In particular, it was hard to figure out how to let it learn the stride of the convolution, but perhaps we could have done without it.

The other problem is that to get a reasonably-sized convnet by pruning and tying a fully-connected net, you need to start with a huge fully-connected net, which slows things down.

taliesinb · on Feb 3, 2016

> In particular, it was hard to figure out how to let it learn the stride of the convolution, but perhaps we could have done without it.

Yeah, it seems impossible because changing the stride changes the size of the output tensor completely.

mozumder · on Feb 3, 2016

Did you try unplugging it and plugging it back in?

thearn4 · on Feb 3, 2016

Just gave a brief run through the paper, and I'm curious: is reverse-mode differentiation in this context similar in concept to adjoint-type methods used in computational fluid dynamics (and increasingly, in design optimization in other fields)?

Backwards-propagation of a gradient calculation to allow for extremely high-dimensional parameter spaces is a trick that has been around for awhile in some circles, but also seems to have been missed in a lot of other disciplines. I'm seeing a lot of publications making use of it in the last few years, and it's pretty exciting to see them used in more places. My hope is that more scientific and engineering analysis codes will expose derivative interfaces for use in numerical optimization.

Here's some short papers with some background for those interested in the topic from a physical engineering perspective:

http://www.piercelab.caltech.edu/assets/papers/ftc00.pdf

http://www.nt.ntnu.no/users/skoge/prost/proceedings/npcw09/A...

duvenaud · on Feb 3, 2016

As far as I understand, the adjoint method computes gradients for functions that obey hard constraints (such as fluid solvers), and the main advantage is that it avoids differentiating through iterative constraint satisfaction procedures.

I had some success with naively differentiating through fluid solvers, though. Here is a fluid field whose initial velocities have been initialized so that it will end up matching a given image (i.e. blowing a fancy smoke ring:)

https://github.com/HIPS/autograd/blob/master/examples/fluids...

and here's a free-form wing shape in the middle of being optimized to maximize lift-to-drag ratio:

https://github.com/HIPS/autograd/blob/master/examples/fluids...

I too find it baffling when I see engineers doing gradient-free optimization of simulated objective functions, although it's not always easy to compute gradients, especially when using massive Fortran codebases or very large-scale simulations.

johntb86 · on Feb 2, 2016

Do we now need hypervalidation sets?

duvenaud · on Feb 3, 2016

Right, once we start heavily hyper-parameterizing our models, there's the potential to overfit the hyperparameters. I think most people agree that has already happened on common datasets such as MNIST.

However, the current method of avoiding hyperparameters is just to have very few of them. This is kind of like avoiding overfitting of parameters by only having 10 parameters in your model - barbaric!

That being said, for hyper-gradients to really be useful, someone needs to develop hyperparameter optimization schemes such as BayesOpt that can condition on gradient information, to allow us to try different hyperparameter settings in parallel. I know at least one group is working on this, but as far as I know it's not ready for prime time yet.

bmh100 · on Feb 3, 2016

Could overfitting also be addressed by sharing and optimizing hyperparameters over multiple datasets? For example, could initialization be shared across multiple handwriting sets?

davidwihl · on Feb 3, 2016

Will this research continue at HIPS or is it all going to Twitter?