If you've ever used "Bayesian optimization" to choose hyperparameters, that was almost definitely a Gaussian process.
The article focused on the case where you have a finite number of test points, which is probably a good idea for an article like this. Still, there is another interpretation of Gaussian processes where they are actual stochastic processes (hence the name), a probability distribution over a set of functions.
I would have found an article that covered that interpretation even more helpful, although I'm not sure an easy-to-follow version could exist.
While a lot of Bayesian optimization methods use GPs (MOE, Spearmint, BayesOpt, etc) some use TPEs as well (most notably, hyperopt [0]), and some ensemble these methods and others like SigOpt (YC W15) [1] (disclaimer: I'm one of the founders).
I tried to briefly go over the functional interpretation of GPs in this talk [2], although the book by Rasmussen and Williams does a much more thorough job [3] (free online, check out chapter 2 for this approach).
I'm happy to answer any questions about the differences. If you're a student/academic SigOpt is also completely free [4].
While this is mostly true there is has been some research to use non-GP models for Bayesian Optimization or the general Sequential Model Based Optimization (SMBO) problem. Some examples:
The point that combining kernels lets you compose arbitrarily complex functions could come in the beginning. That's sort of why GPs are so exciting in the first place.
In particular, there's a kernel we could call a "change point," which is a way to have a totally different model fitted before a point in time versus after. It's used frequently in Automatic Statistician fitted models (https://automaticstatistician.com/examples/) which in my opinion are the state of the art of what you can use GPs for. They also developed a "LISP" like representation of the GP kernels, which lets them sample functions, fit them, and publish the simplest ones.
You can see an example of, "Create GP functions and try fitting them" here: https://github.com/probcomp/notebook/blob/master/tutorials/e... . Near the end of the notebook, you can see the "source code" of the fitted GP function. Note this implementation supports change points, but does not happen to need them on the sample data.
But generally, I wonder in which applications GP shines.
On the one hand, with its emphasis on time series data, GP has a lot to offer to finance, especially options pricing. On the other hand, GP boils down to "the near future looks a lot like the near past," which most people already know.
Clearly, what we want to know is: when will change points occur? Whoever cracks that nut has found GP its breakthrough application.
It should be clear from the example that some form of fitting over the generation of "Gaussian Process Programs" is a good first step.
One major application is in geospatial statistics for a variety of fields that need to perform regression of irregularly samples across space. Although it is typically called Kriging, it is mathematically equivalent to Gaussian processes from my understanding:
The killer feature of Gaussian processes for me is the fact that they provide a posterior distribution rather than just a maximum likelihood estimate. This lets you incorporate fancy loss functions, and can be useful to optimize data collection in real world cases where collecting data is expensive and time consuming.
Maybe one usage could be in Thompson sampling. You're trying to guess the maximum of a function, so you sample a random function from the GP, find its maximum, test it out, get back a y value, and then update your GP.
Maybe that's what the above comment meant by Bayesian optimisation.
I'm sorry to be a little off-topic here, but can someone please tell me how this guy got those math equations looking like that? They look like the common Latex font, but they're neither SVG nor PNG output of Latex as I first thought. How did he manage to get this??
From my perspective, this is an outstanding intro to Gaussian processes, and indeed, I much prefer this rather than posted article. Your code and visuals make things a lot more cleaner. You could make it better by including a bit more info on different types of Kernels / combining kernels - otherwise, this is fantastic! Thank you!
For X (your test data) and Y (your training data), we have our prior: X ~ N(0, <kernel>) and Y ~ N(<training values>, <identity>). For the joint distribution I can accept that the covariance will have the same kernel function (albeit on |X| + |Y| dimensions) but what would the mean be?
The article focused on the case where you have a finite number of test points, which is probably a good idea for an article like this. Still, there is another interpretation of Gaussian processes where they are actual stochastic processes (hence the name), a probability distribution over a set of functions.
I would have found an article that covered that interpretation even more helpful, although I'm not sure an easy-to-follow version could exist.