Unsupervised Learning with Even Less Supervision Using Bayesian Optimization

Zephyr314 · on March 11, 2016

One of the co-founders of SigOpt (YC W15) here. I'm happy to answer and questions about this post or the methods used. More info on the Bayesian methods behind this can be found at sigopt.com/research as well!

bearzoo · on March 12, 2016

Well - just my two cents. The title feels inaccurate. You all are tuning hyper parameters with respect to the performance of the classification task. The bayesian optimization is really to optimize the unsupervised -> supervised pipeline. I was expecting some bayesian optimization of strictly unsupervised representation learning (ex. we have an autoencoder and use some bayesian optimization to tune hyper parameters in order to minimize a reconstruction error). This is really just supervised learning with even less supervision (which is quite typical).

Zephyr314 · on March 12, 2016

Thanks for the note!

We're using Bayesian optimization to tune both the hyperparameters of the unsupervised model and the supervised model, but you are correct that they are being done in unison with the overall accuracy being the target. The lift you get from adding the unsupervised step (and tuning it) is quite substantial (and statistically significant).

The idea of tuning just the unsupervised part (or doing it independently) is great though. All the code for the post is available at https://github.com/sigopt/sigopt-examples/tree/master/unsupe.... It would be interesting to see if doing that would make for a better overall accuracy.

pizza · on March 11, 2016

Actually, out of curiosity, would there be some way to use the inverse coloring transform + a lil noise to generate some kind of equivalence class of free training examples, sort of a la skip-gram?

idewanck · on March 11, 2016

Author here, you can definitely talk about augmenting your training data with some slight transforms of the labelled set you have. Common strategies are adding noise, rotation etc for images

pizza · on March 12, 2016

Interesting. I guess the question then becomes what constitutes a "big" transformation that preserves relevant invariants.

tavert · on March 12, 2016

No constrained optimization?

Zephyr314 · on March 12, 2016

SigOpt allows specifying constraint ranges for every parameter. If the parameter space isn't a tensor product you can report constraint violating suggestions as failures via our API [1] and SigOpt will take that into account as it optimizes.

SigOpt isn't a constraint optimization package for solving things like k-SAT thought if that is what you were asking.

[1]: https://sigopt.com/docs/endpoints/observations/create

tavert · on March 13, 2016

In addition to bounds, say I'd like to solve a problem where the variables must satisfy a linear or nonlinear system of equations. So overall there are fewer degrees of freedom than the number of variables, but the problem structure is such that it may be much more efficient to express the objective function and constraints in terms of the higher-dimensional space.

Zephyr314 · on March 13, 2016

This is where the reporting of "failures" as mentioned above can be helpful. Any suggestion that violates your system of equations based constraints can be instantly reported as a "failure," and this will be taken into account as SigOpt converges to the best parameters.

You could also try to bake this into the objective function (with an L2 penalty for how bad it violates some constraints), depending on how hard the constraints of the problem actually are.

tavert · on March 15, 2016

Equality constraints means you're working in a lower-dimensional subspace, I'd have to find out more about your algorithms but I'd be surprised if you wound up evaluating many feasible points at all (esp. with nonlinear equalities).

L2 penalty would likely converge to an infeasible solution. Augmented Lagrangian would be better, but then you're making users handle dual updates. At that point I'd rather use an actual constrained optimization library that does the algorithm carefully, and use primal-dual interior point. Not having this kind of thing built in counts as "no constrained optimization" IMO.

lqdc13 · on March 12, 2016

Is this the first OHAAS (Optimize Hyperparameters As A Service)?

Zephyr314 · on March 12, 2016

We were the first company to launch (Whetlab was bought and shut down before they got out of private beta).

We're currently the only active company offering it as a service.

While hyperparameter optimization is one of the most common use cases of SigOpt right now, the general Bayesian Optimization As A Service we provide has also been used to tune simulations and even manufacturing and process engineering [1].

[1]: https://sigopt.com/cases/process_engineering

IanCal · on March 12, 2016

There is/was whetlab which got bought out by Twitter if I remember rightly. It's a shame as I was using them and wanted to do it more.

Zephyr314 · on March 12, 2016

We provide a very similar interface to Whetlab and I would be happy to get you set up on SigOpt. We offer a free trial to get started [1] and a free academic plan [2].

[1]: https://sigopt.com/get_started

[2]: https://sigopt.com/edu

bearzoo · on March 12, 2016

https://github.com/JasperSnoek/spearmint

Zephyr314 · on March 12, 2016

We've found that SigOpt compares very well to spearmint, as well as MOE [1], which I wrote and open sourced around the same time spearmint was open sourced. We have a paper coming out soon comparing SigOpt rigorously to standard methods like random and grid search as well as other open source Bayesian methods like MOE [1], spearmint, HyperOpt [2], and SMAC [3] with good results.

[1]: https://github.com/Yelp/MOE

[2]: https://github.com/hyperopt/hyperopt

[3]: http://www.cs.ubc.ca/labs/beta/Projects/SMAC/

bearzoo · on March 12, 2016

With spearmint I had the ability to modify the parameters of the mcmc sampling (e.g. burn in iterations). Will sigopt expose parameters for those of us who want to manipulate them? Will there be options to use different types of function estimators to estimate the mapping between hyper params and performance (i.e. what if I would like to use a neural network or a decision tree instead of gaussian processes)?

I say these things because as someone who is active in machine learning - I often want to optimize hyper parameters. The type of people that are serious about optimizing hyper parameters (i.e. people who may not like to use grid or random searches) for a model are usually some what technical. Your product seems to be catered to those who may not be too technical (very simple interface, etc). How will you balance what you expose in the future without giving away too much of your underlying algorithms?

Zephyr314 · on March 13, 2016

As you pointed out, it is all about a balance, and every feature has different tradeoffs.

SigOpt was designed to unlock the power of Bayesian optimization for anyone doing machine learning. We believe that you shouldn't need to be an expert and spend countless hours of administration to achieve great results for every model. We're wrapping an ensemble of the best Bayesian methods behind a simple interface [0] and constantly making improvements so that people can focus on designing features and their individual domain expertise, instead of needing to build and maintain their own hyperparameter optimization tools to see the benefit.

For experts who want to spend a lot of time and effort customizing, administering, updating, and maintaining a hyperparameter tuning solution I would recommend forking one of the open source packages out there like spearmint [1] or MOE [2] (disclaimer, I wrote MOE while working at Yelp).

[0]: https://sigopt.com/docs

[1]: https://github.com/JasperSnoek/spearmint

[2]: https://github.com/Yelp/MOE

bearzoo · on March 15, 2016

thanks for all the great responses!

pizza · on March 11, 2016

Now just throw some compressive sensing at the problem ;)