A Visual Exploration of Gaussian Processes

currymj · on Jan 2, 2019

If you've ever used "Bayesian optimization" to choose hyperparameters, that was almost definitely a Gaussian process.

The article focused on the case where you have a finite number of test points, which is probably a good idea for an article like this. Still, there is another interpretation of Gaussian processes where they are actual stochastic processes (hence the name), a probability distribution over a set of functions.

I would have found an article that covered that interpretation even more helpful, although I'm not sure an easy-to-follow version could exist.

Zephyr314 · on Jan 2, 2019

While a lot of Bayesian optimization methods use GPs (MOE, Spearmint, BayesOpt, etc) some use TPEs as well (most notably, hyperopt [0]), and some ensemble these methods and others like SigOpt (YC W15) [1] (disclaimer: I'm one of the founders).

I tried to briefly go over the functional interpretation of GPs in this talk [2], although the book by Rasmussen and Williams does a much more thorough job [3] (free online, check out chapter 2 for this approach).

I'm happy to answer any questions about the differences. If you're a student/academic SigOpt is also completely free [4].

[0]: http://hyperopt.github.io/hyperopt/

[1]: https://sigopt.com/research/

[2]: https://www.youtube.com/watch?v=J6UcAdH54RE&list=PLbSwfqjMfj...

[3]: http://www.gaussianprocess.org/gpml/chapters/RW.pdf

[4]: https://sigopt.com/solution/for-academia

mlevental · on Jan 3, 2019

this person has responded 3 times in this thread with the same links. aren't there rules against shilling?

sah2ed · on Jan 3, 2019

The superfluous posts are now dead. Rules working as intended.

abhgh · on Jan 3, 2019

While this is mostly true there is has been some research to use non-GP models for Bayesian Optimization or the general Sequential Model Based Optimization (SMBO) problem. Some examples:

* SMAC (using Random forests) - [1]

* Using Deep Neural Networks - [2]

* Tree Structured Parzen Estimators - [3]

[1] https://www.cs.ubc.ca/~hutter/papers/10-TR-SMAC.pdf

[2] https://arxiv.org/pdf/1502.05700.pdf

[3] https://papers.nips.cc/paper/4443-algorithms-for-hyper-param...

an_opabinia · on Jan 2, 2019

The point that combining kernels lets you compose arbitrarily complex functions could come in the beginning. That's sort of why GPs are so exciting in the first place.

In particular, there's a kernel we could call a "change point," which is a way to have a totally different model fitted before a point in time versus after. It's used frequently in Automatic Statistician fitted models (https://automaticstatistician.com/examples/) which in my opinion are the state of the art of what you can use GPs for. They also developed a "LISP" like representation of the GP kernels, which lets them sample functions, fit them, and publish the simplest ones.

You can see an example of, "Create GP functions and try fitting them" here: https://github.com/probcomp/notebook/blob/master/tutorials/e... . Near the end of the notebook, you can see the "source code" of the fitted GP function. Note this implementation supports change points, but does not happen to need them on the sample data.

But generally, I wonder in which applications GP shines.

On the one hand, with its emphasis on time series data, GP has a lot to offer to finance, especially options pricing. On the other hand, GP boils down to "the near future looks a lot like the near past," which most people already know.

Clearly, what we want to know is: when will change points occur? Whoever cracks that nut has found GP its breakthrough application.

It should be clear from the example that some form of fitting over the generation of "Gaussian Process Programs" is a good first step.

cafebeen · on Jan 2, 2019

One major application is in geospatial statistics for a variety of fields that need to perform regression of irregularly samples across space. Although it is typically called Kriging, it is mathematically equivalent to Gaussian processes from my understanding:

https://en.wikipedia.org/wiki/Kriging

CuriouslyC · on Jan 2, 2019

The killer feature of Gaussian processes for me is the fact that they provide a posterior distribution rather than just a maximum likelihood estimate. This lets you incorporate fancy loss functions, and can be useful to optimize data collection in real world cases where collecting data is expensive and time consuming.

man-and-laptop · on Jan 2, 2019

Maybe one usage could be in Thompson sampling. You're trying to guess the maximum of a function, so you sample a random function from the GP, find its maximum, test it out, get back a y value, and then update your GP.

Maybe that's what the above comment meant by Bayesian optimisation.

pen2l · on Jan 2, 2019

I'm sorry to be a little off-topic here, but can someone please tell me how this guy got those math equations looking like that? They look like the common Latex font, but they're neither SVG nor PNG output of Latex as I first thought. How did he manage to get this??

detaro · on Jan 2, 2019

https://github.com/KaTeX/KaTeX

(another project in this space is https://www.mathjax.org/)

ajschumacher · on Jan 3, 2019

I like the visualizations for showing how different kernels work, but I tend to prefer my explanation of GPs:

https://planspace.org/20181226-gaussian_processes_are_not_so...

Would love to get more feedback as well!

photon_lines · on Jan 5, 2019

From my perspective, this is an outstanding intro to Gaussian processes, and indeed, I much prefer this rather than posted article. Your code and visuals make things a lot more cleaner. You could make it better by including a bit more info on different types of Kernels / combining kernels - otherwise, this is fantastic! Thank you!

krackers · on Jan 3, 2019

For X (your test data) and Y (your training data), we have our prior: X ~ N(0, <kernel>) and Y ~ N(<training values>, <identity>). For the joint distribution I can accept that the covariance will have the same kernel function (albeit on |X| + |Y| dimensions) but what would the mean be?

fractalf · on Jan 2, 2019

Not much visual though is it ;)

hooloovoo_zoo · on Jan 2, 2019

Is there supposed to be a Distill banner on this website?

nicklovescode · on Jan 3, 2019

It's under review for Distill (there's a mention of it on the page as well). It's using the Distill template - https://github.com/distillpub/template

riversdark · on Jan 3, 2019

"The mean of this probability distribution then represents the most probable characterization of the data."

Shouldn't it be the mode?

jey · on Jan 3, 2019

True, but a Gaussian distribution's mode is its mean.

tauwauwau · on Jan 3, 2019

Off the topic. Does anyone know which chart library is being used here?

elpres · on Jan 3, 2019

D3 and Svelte

https://twitter.com/_jgoertler/status/1079781405138305024

grzm · on Jan 3, 2019

I believe it's part of the distill.pub framework:

https://distill.pub

https://github.com/distillpub