Gaussian Processes, not quite for dummies

bhl · on Sept 17, 2019

Implementing Gaussian Processes in Python is a fun way to practice your programming and mathematical abilities.

I was working on a course project that used GPs for regression, with the inputs taken from the output of a CNN. We were trying to predict crop yield from satellite imagery, and GPs helped to incorporate spatio-temporal locality: think of predictions for neighboring county, or for image from a previous year.

One issue we ran into while using the model from the original authors [1] was a memory constraint from building on top of Google Colab instead of a private AWS instance, and using a larger dataset to predict upon.

The Gaussian kernel matrix grows O(n^2) per data point, and we had up to 22,000 datapoints to use. To get around the memory constraint, we had to re-implement the GP from scratch - carefully avoiding memory spikes that came with computing the kernel and mean predictions.

It was a two-fold process. For understanding the exact equations, I would highly recommend the resource mentioned in the article [2], Gaussian Processes for Machine Learning Chapter 2. For a lecture video, I would recommend [3] which is from a ML course at Cornell.

As for implementing GPs efficiently, this is still a challenge: numpy has a bunch of matrix-vector functions that are helpful, but there isn't a way to exploit the symmetry of the kernel matrix and use half the require memory.

[1] https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14435/...

[2] http://www.gaussianprocess.org/gpml/chapters/RW.pdf

[3] https://www.youtube.com/watch?v=R-NUdqxKjos

nestorD · on Sept 17, 2019

Another easy solution, if your kernel is appropriate (goes to 0 at infinity such as a Gaussian, Matèrn or Exponential kernel), is to set a threshold under which your covariance is set to zero and to use a sparse representation for the covariance matrix (you can even be more extreme and set a fixed number of entry to be non-zero giving you a fixed memory use).

mrfox321 · on Sept 17, 2019

This does not solve the issue. You still need to invert this sparse matrix for inference. The inverse of a sparse covariance matrix is dense.

nestorD · on Sept 17, 2019

The raw mathematical formula uses an inversion but what you truly need is to solve a linear system (which is why most implementations store a cholesky decomposition of the covariance matrix) which can be done without transforming the matrix into a dense matrix.

bhl · on Sept 17, 2019

+1 This is how I solved for the optimal parameter: instead of inverting the matrix, which is memory-heavy and computationally expensive, I used the solve function from scipy.linalg.

I looked into numpy and scipy methods for symmetry and sparsity, but my kernel matrix is still too dense.

clarkeni · on Sept 17, 2019

What would you use to solve the system? Iterative methods or gradient descent even?

plus · on Sept 17, 2019

Something like MINRES or GMRES or CGSTAB. Correct me if I'm wrong, but the covariance matrix in GPR should be diagonally dominant, so the diagonal of the matrix can be used as a somewhat effective preconditioner.

nestorD · on Sept 17, 2019

With most kernels the matrix is symmetric and positive-definite so I would probably use a conjugate gradient: https://en.wikipedia.org/wiki/Conjugate_gradient_method

lp251 · on Sept 17, 2019

m3at · on Sept 17, 2019

Regarding the challenge of an efficient implementation, pytorch+pyro attempt to tackle it. It's still a work in progress, but you might want to have a look at: https://pyro.ai/examples/gp.html

conjectures · on Sept 17, 2019

The real issue is that inversion / solving is O(n^3).

This really hammers you for high dimensional settings (e.g. brain imaging n>100k).

Solutions:

* Pick a very special kernel with an known / approximable solution.

* Reduce the dimension of the problem (sparse gps).

bhl · on Sept 17, 2019

The n parameter there was the number of datapoints, not the size of the inputs (that would've been way worse). For brain imaging, is the 100k parameter also referencing datapoints, or dimensionality of the input?

Chapter 8 of the Rasmussen reference does talk about approximation methods for large datasets, but I couldn't find any reference implementations to go off of.

conjectures · on Sept 18, 2019

"number of datapoints, not the size of the inputs"

For brain imaging, 100k was 'the dimensionality of the input' (but also the number of datapoints). This could easily be O(10^6) though with higher res imaging.

I'd check out, for discussion:

https://www.prowler.io/blog/sparse-gps-approximate-the-poste...

For code:

https://github.com/GPflow/GPflow

Based on the people involved knowing what they are talking about, rather than on experience using that particular work.

radarsat1 · on Sept 17, 2019

Seeing the complications implied by the O(n^2) problem, I'm curious why it was worth it? What did using GPs to solve the problem buy you that you couldn't get with other methods?

I think GPs are interesting but the kernel matrix is a real problem with them in practice. I have yet to understand why they are worth the cost compared to just using least squares or some other regression method. Confidence intervals are an obvious benefit, but since you can get them via monte-carlo, at what point does the GP method pay off?

bhl · on Sept 17, 2019

Compared to training the CNN, running a GP on the outputs seemed like a worthwhile tradeoff of time if it resulted in a discernible difference in validation error; and the original authors of the paper we were following already had an open implementation.

I am open to suggestions if you know of any "post-processing" methods that would help to incorporate locality into the overall prediction.

radarsat1 · on Sept 18, 2019

What I mean is, why GP instead of, say polynomial regression, or training a NN on the CNN outputs?

I'm curious if the decision was due to some feature of GPs that you really needed, and could not easily get otherwise. I'm not doubting your choice, really, I'm just trying to get a feeling for what niche GPs fill (given their difficult computational requirements), as I have never found the opportunity to use them.

> that would help to incorporate locality into the overall prediction.

Not sure what you mean by this so I am sure I am missing something.

bhl · on Sept 18, 2019

We used a Gaussian process on the outputs from a CNN, not separately.

What I meant by locality is the idea that two points which are close to each other should also have predictions close to each other, assuming that there is some underlying Gaussian variable we can't see immediately from the data or from the CNN.

One example could be housing prices: if your neighbor sold their house, controlled for other factors like square ft or the number of bedrooms, you should value your own house similarly. We assume that there is this underlying quality of the neighborhood, possibly given by distance between houses.

The niche that GPs fill is broad. Two applications I've seen were their use in trying to find the missing MH370 flight, and another is in hyperparameter tuning of ML models.

Wookai · on Sept 17, 2019

I would highly recomment David Duvenaud's PhD thesis [1] for anybody interested in Gaussian Processes. I find his chapter [2] on expressing structure with kernels super interesting and quite intuitive. It's also one of the PhD theses with the best-looking graphics I've read!

[1] http://www.cs.toronto.edu/~duvenaud/thesis.pdf

[2] https://raw.githubusercontent.com/duvenaud/phd-thesis/master...

graycat · on Sept 17, 2019

The OP needs some correction:

The title is "Gaussian Processes". Okay, so, we're considering "processes". A Gaussian process is a special case of a stochastic process.

Here is the usual, first definition of a stochastic process: For the set of real numbers R and for each t in R (t usually regarded as time) we have a random variable X_t (TeX notation for subscripts) where the values of X_t are in R. That is, X_t is a real valued random variable. So, the stochastic process is, say,

{X_t | t in R}

This stochastic process is a Gaussian process if for each t in R X_t has Gaussian distribution.

Commonly we continue and assume that both the expectation and variance of each X_t is the same for each t in R.

In addition for any s in R we might assume that the covariance

Cov(X_t, X_{s + t})

(more TeX notation) is a constant for all t.

With these assumptions we can do a lot in power spectral estimation and extrapolation.

We can apply the central limit theorem to argue that an independent increments process is Gaussian.

For more, see, e.g.,

I. A. Ibragimov and Y. A. Rozanov, Gaussian Random Processes, ISBN 0-387-90302-x, Springer-Verlag, New York, 1978.

For one more, if real valued random variables X and Y are independent, then they are uncorrelated. Net, independence is a much stronger assumption than uncorrelated.

But, amazingly, if X and Y are joint Gaussian and uncorrelated, then they are independent.

For one more, for the example of curve fitting and interpolation, I'd suggest not polynomials but splines or least squares splines.

For one more, for Gaussian X_t for finitely many t, the sample mean and sample variance are sufficient statistics. That means that in any work in statistics, we can keep the sample mean and variance, throw away the X_t, and in principle do as well. The foundation paper on sufficient statistics was by Halmos and Savage back in the 1940s. The concept of sufficient statistics is an application of the Radon-Nikodym theorem. E. Dynkin showed that the Gaussian assumption is critical -- with small changes the result is false.

yugeten · on Sept 17, 2019

Original author here: thanks a lot for that, this is very helpful! I was a bit confused about that myself, will make corrections.

CrazyStat · on Sept 17, 2019

> This stochastic process is a Gaussian process if for each t in R X_t has Gaussian distribution.

This is not quite strong enough. You need (X_t1, X_t2, ..., X_tn) to jointly have a multivariate Gaussian distribution for every finite (t1, ..., tn).

graycat · on Sept 17, 2019

Usually! I gave a weak definition!

Also, for correlation, need to assume the expectations exist.

ricv · on Sept 17, 2019

ive been teaching myself stochastics, i personally find it really interesting. i really like Peter Medvegyev book Stochastic Integration Theory, do you have any thoughts on it? you really did a good job at summarizing the bigger picture in a few words, really helped, so thank you. any additional recommended resources that ties things together like this? edit: also liked the write up yugeten, was helpful

graycat · on Sept 17, 2019

To get very far in stochastic processes, usually need a good background in measure theory. Here is an overview of about two semesters in a fast, often rare, grad course plus much more:

The integral of freshman calculus is the Riemann integral. Too soon/easily it was seen to be clumsy to inadequate in theoretical developments. Near 1900 in France, E. Borel student H. Lebesgue improved the situation; he did his Lebesgue integral. Don't worry: For every function integrated in freshman calculus, Lebesgue's integral gives the same answer. Otherwise Lebesgue's integral is much more general and, it turns out, is the solid foundation of probability and stochastic processes.

Lebesgue's work starts with measure which is just a notion of area. So, yes, on the real line, the Lebesgue measure of the closed interval [0,1] is 1, just the same as we learned in grade school. But with Lebesgue's work, it's tough to think of a subset of the real line that is not measurable (has a measure or area in Lebesgue's work) -- the usual example of a subset of R that is not measurable needs the axiom of choice. With measure defined on R, Lebesgue can define the Lebesgue integral of functions

f: R --> R

For this we need a meager assumption (which essentially always holds), but lets charge on:

For the generalizations, crucial for probability and stochastic processes, start with a non-empty set M. R is an example of a such a set M. Define a measure on M, say, m. Then for some subsets A of M, we have the measure of set A as m(A). That is, we have the area (volume, whatever) of A. But to do much, we need more assumptions: We want the set of all subsets A of M to which we assign measure m(A) to be a sigma algebra, say, F. As at

https://en.wikipedia.org/wiki/Sigma-algebra

we want M to be an element of F, for each A in F we want the complement, say, the set M - A (obvious but sloppy notation) to be an element of F, and for A_i in F for i = 1, 2, ..., we want the union of the A_i to be an element of F. So, we want F closed under complements and countable unions. No, we don't permit uncountable unions, a point that keeps popping up in stochastic processes. Permitting uncountable unions results in, in one word, a mess.

The sigma part of sigma algebra is supposed to refer to the countable unions as in the capital letter sigma used in calculus, etc., for countable sums.

So, we have the set M, the sigma algebra of subset F, and the measure m. Then the pair (M, F) is a measurable space and the triple (M, F, m) is a measure space. In probability theory and stochastic processes, the measure space used is usually denoted by space, capital omega, the sigma algebra, script F, and the measure, P.

Right -- I just let the cat out of the bag: Probability P, as we saw in elementary treatments of probability, is a measure on the measure space with Omega, script F, and P.

So for A a subset of the space Omega and an element of sigma algebra script F, P(A) is the probability of set A. Set A is called an event. Now we are back to what you saw in elementary treatments of probability.

So, it went too fast -- we mentioned measure theory with (M, F, m) and then took a special case Omega, script F, P, a probability space, and got back to P(A) we saw in elementary treatments.

This jump from a measure space to a probability space was made by A. Kolmogorov in 1933 and has been the foundation of 99 44/100% (as in Ivory soap!) of advanced work in probability and stochastic processes since then.

So, back to the case of measure space (M, F, m). Suppose we have a function

f: M --> R

We want to integrate f as in freshman calculus. Well, for each subset A of M and an element of sigma algebra F, we have the measure (area) of A, that is, m(A). Well with one more assumption, and what Lebesgue did, that's enough! And it's assuming less than Riemann did and, thus, getting more generality in some quite good ways.

Then for the case of Lebesgue integrating some

f: R --> R

compared with the Riemann integral, Lebesgue partitions on the Y axis and Riemann partitions on the X axis. So, Riemann partitions on the domain of the function and Lebesgue partitions on the range. In both cases we are working with lots of little rectangles that get smaller as we converge. So, Lebesgue lets the domain be much more general, e.g., enough for probability theory.

So, back to a probability space and a function

f: Omega --> R

With the meager assumption (that f is measurable -- to be brief I'm skipping over the details here is only because nearly all functions are measurable -- tough to construct one that isn't, so general is Lebesgue's work), then f is a random variable. It is more common to denote random variables by X, Y, Z, etc. Now, finally, maybe after getting lost in a swamp for some years, you have a solid definition of a random variable. Then each point in Omega is one probability or statistical trial.

Then, sit down for this amazing fact: The expectation of real valued random variable X, denoted by E[X], is just the Lebesgue integral of the function

X: Omega --> R

So, in this way, Kolmogorov borrowed from Lebesgue and got a solid foundation for probability and stochastic processes.

Now you know why Lebesgue's measure theory is key to advanced work in probability and stochastic processes -- E[X] is just Lebesgue's integral generalization of the calculus Riemann integral.

Long standard very careful, polished, treatments of measure theory are in Royden, Real Analysis and the first half of Rudin, Real and Complex Analysis. Continuing with probability is Breiman, Probability, Neveu, Mathematical Foundations of the Calculus of Probability, Loeve, Probability, etc. Uh, Loeve was long at Berkeley, and both Breiman and Neveu were Loeve students.

Can use this work to make good sense out of Brownian motion, and there are plenty of books there.

And can do more on stochastic processes, e.g.,

Ioannis Karatzas and Steven E. Shreve, Brownian Motion and Stochastic Calculus.

R. S. Lipster and A. N. Shiryayev, Statistics of Random Processes.

Ronald K. Getoor, Markov Processes and Potential Theory.

This material is now decades old. With this material it is possible to give high end treatments of some topics in Wall Street type investing, e.g., exotic options and in particular the Black-Scholes formula.

This material was called by Breiman "graduate probability" as he left the field and did, e.g., CART -- classification and regression trees, maybe the most important seed for the current work in machine learning. Breiman was interested in making sense out of some complicated medical data.

The potential of such graduate probability for careers in academics and/or applications seems now to be open to question. But, with a larger view, probability and stochastic processes just will NOT go away and will remain some of the most powerful tools for making sense out of reality -- we can be sure that over the horizon we will be glad to have such a solid foundation and WILL encounter important, new applications, ours or those of others.

There is an old point: Long it was common for the pure mathematicians, yes, to pay close attention to measure theory but essentially to ignore the role in probability theory, saying, that probability theory was just a trivial special case of measure theory for positive measures (can also have measures that have negative or complex values) where the measure of the whole space was 1.

But: Probability theory does a LOT with the assumption of independence tough to find in texts in just measure theory. Also the Radon-Nikodym theorem (an amazing generalization of the fundamental theorem of calculus) IS treated in measure theory but the applications crucial in probability and stochastic processes to conditioning and sufficient statistics usually are not.

So, with this overview, you might find Royden, Rudin, Breiman, Neveu (especially elegant) not so hard reading and a lot of fun.

You will learn about the cases of convergence of sequences of random variables and the classic results of the central limit theorem, the laws, weak and strong, of large numbers, ergodic theory (pour cream into coffee and stir until the cream returns to the distribution it had in the coffee just after pour it in -- Poincare recurrence), and martingales. Margingales are amazing and provide the easiest proof of the strong law of large numbers and some of the strongest inequalities in math.

So, amazing stuff: Go into a lab, measure something, walk out, and have the value of a random variable -- okay, that's good enough for an elementary course. But typically in math, we can't calculate just what we want in just one step so have to iterate and approximate and hopefully converge in some appropriate sense to just what we want. Well, in those classic results, we have notions of convergence. The most amazing case is of strong convergence where we have a sequence of random variables that converge to some random variable. From the elementary treatments, having a sequence of randomness converge to more randomness seems absurd. But with Kolmogorov's treatment, a random variable is just a function, and from math such as Rudin, Principles of Mathematical Analysis we know quite a lot about convergence, and in measure theory we learn more. Then the case of a sequence of random variables converging to another random variable is just a special case of one case of convergence of sequences of functions. So, we can approximate a random variable X by having a sequence of random variables converge to random variable X. How 'bout that! Thank you Borel, Lebesgue, Kolmogorov, von Neumann, etc.

Have fun! Get rich???

nicklovescode · on Sept 17, 2019

Great work Yuge!

This post from Distill may also be worth checking out for those interested in Gaussian Processes:

https://distill.pub/2019/visual-exploration-gaussian-process...

leni536 · on Sept 17, 2019

> if all k components are Gaussian random variables, then X must be multivariate Gaussian

They need to be independent as well or there should be some other criteria. Otherwise I have counterexamples.

kgwgk · on Sept 17, 2019

> They need to be independent as well

No, the sum of correlated Gaussians is also Gaussian. [Edit: or more precisely it can be, if they are jointly Gaussian, the point is that they they don’t need to be un correlated.]

bluesign · on Sept 17, 2019

They dont need to be independent but they need to be jointly gaussian, but it is a very small edge case I guess

em500 · on Sept 17, 2019

You're right and the author's note is wrong. Joint Normality of a random vector implies component-wise Normality, but not the other way around, so it's a stronger condition.

AstralStorm · on Sept 17, 2019

You can check if this holds by inspecting the covariance matrix. It can be degenerate or the variables may be correlated (thus not joint normal) and even independent uncorrelated gaussian ones need not be.

https://en.wikipedia.org/wiki/Multivariate_normal_distributi...

It's best to check using the second definition from: https://en.wikipedia.org/wiki/Multivariate_normal_distributi...

This results in the BHEP test: https://en.wikipedia.org/wiki/Multivariate_normal_distributi...

kgwgk · on Sept 17, 2019

> You can check if this holds by inspecting the covariance matrix.

Just a clarification: if that’s somehow correct then it’s not about the 2x2 covariance matrix that gives the linear relationship between two variables. Because two variables can be correlated and the joint distribution be a multivariate Gaussian. And two variables could be uncorrelated and not be independent; the covariance matrix doesn’t contain much information about the joint distribution (unless joint normality is assumed).

AstralStorm · on Sept 18, 2019

Correct, that's only useful to check for degeneracy. The test is much more certain.

And it's bigger than 2x2, it depends on number of dimensions you're checking. 3 variables goes 3x3 etc. For fitting as shown (and GANs), it gets big really fast.

kgwgk · on Sept 18, 2019

I don't understand what covariance matrix you're talking about.

Let's say we have two variables and the joint distribution is given by the probability density function p(x,y) which can have any form as long as it's always >=0 and integrates to 1.

The covariance matrix is a 2x2 matrix and contains only three real numbers (four, but the off-diagonal terms are necessarily equal).

The covariance matrix (together with the vector of means) gives a complete description of the joint distribution only when the joint distribution is completely specified by its first two moments and all its higher moments are zero (i.e. it is a multivariate normal).

The fact that the correlation between the variables is zero doesn't mean necessarily that the variables are independent when the joint distribution is not a multivariate normal.

The covariance matrix alone cannot be used to tell whether the joint distribution is a multivariate normal or not. Every possible covariance matrix defines a multivariate normal but is also compatible with an infinity of other distributions.

leni536 · on Sept 17, 2019

A~N(0,1) and I~Bernoulli(1/2) are independent random variables. Then both X1=A and X2=(1-2I)*A are Gaussian random variables. They even have 0 covariance but they are not independent. Also X1+X2 is not Gaussian.

kgwgk · on Sept 17, 2019

You were right on the “or there should be some other criteria” part, but not on the “They need to be independent” part.

If you start with independent Gaussians A and B and accept that A+B and A-B are Gaussians you’ll easily see that the sum of those is also Gaussian (because it’s 2A) even though they are not independent.

JustFinishedBSG · on Sept 17, 2019

Yes you need that any linear combination of the k Gaussian is Gaussian.

itissid · on Sept 17, 2019

The most intuitive explanation of GPs came to me from McElreath's Statistical rethinking. What was most sticky was that was explained in terms of a thing needed for a simple regression problem that becomes hairy with continuous variable. It was an example of modeling a caffe's wait time which varies more continuously with different covariates like time of the day and IIRC size of the cafe.

wodenokoto · on Sept 17, 2019

Is it this one?

https://xcelab.net/rm/statistical-rethinking/

itissid · on Sept 17, 2019

sbt · on Sept 17, 2019

Great article, easy to understand. I also like the unpretentious tone of it.

Dawny33 · on Sept 17, 2019

Was going through this article this morning, and absolutely loved it.

Post that, you can go through the vid lecture recommended in the article. Makes it easy to follow.

nabla9 · on Sept 17, 2019

May I suggest next article to read: An intuitive, visual guide to copulas https://twiecki.io/blog/2018/05/03/copulas/

Dawny33 · on Sept 17, 2019

Thanks mate. This is awesome. The explanations + code made it very easy to understand.

Jenz · on Sept 17, 2019

I don’t get much of this kind of articles, but I love just looking at these beautiful graphs.

dbcurtis · on Sept 17, 2019

Great read! Well done!

nudpiedo · on Sept 17, 2019

Am I the only one who found easier and intuitive the abstract definition but much harder to follow the mathematical explanation? I immediately thought on branching conditions mapped to a function table (like a knowledge database table with pattern matching and related to a functions table).

I understand the higher value here is the Gaussian distribution and how to find or create a distribution that fits real world complex problems, which would become severely complicated to implement, analyse or visualize with classic code writing.

P.S. I am speaking form an IT Engineer point of view, the years I studied math fall more than a decade ago...

wiz21c · on Sept 17, 2019

I'd like to point out that the mathematical/formal statements are the result of loooong researches. That is, the one who stated them most probably had a real life problem first. Then multiple other ones. Then he built up an understanding of those problem and some kind of intuitions. Then he gathered all of that in a nice formula.

When we learn that formula (sometimes) we go the other way around : we see the formula, the maths, the forma stuff and then we try to understand its consequences and see how it fits reality.

To me that's a much harder process...

Basically : when I see a bridge, I can't figure out how it was built, or why.