This really brings me back to my days in college, when this was the exact sort of stuff that ML classes focused on. You could have an entire course on various interpretations and extensions of kernel based methods. I wonder how much these insights are worth anymore in the age of LLMs, deep neural networks. I havent kept up too much with the NTK literature but it seems like the theoretical understanding of kernel based methods, gaussian processes does not confer you any advantage in being a better modern ML (specifically working on LLMs) engineer, where the skill set is more heavily geared towards systems engineering and/or devops for babysitting all your experiments.
I basically share your sentiment. However, Greg Yangs work seems to have produced something of direct practical benefit for training large neural nets, based on the NTK literature. µ-Parametrization is apparently very useful in real world practice.
While the DevOps and data management is critical, if you are able to understand what the different components and layers do you can go a lot further than someone who can only retrain someone else's models on different data. Academia hasn't provided reference architectures for a lot of difficult industry problems that still need solving.
I guess the systems engineering / babysitting part is an intermediate step in the process (fits and starts of a new engineering discipline) and I hope that CS/ML research goes sideways or backs up a bit there, to (maybe less brute force) alternatives or better/deeper understanding.
It's great to see so much interest in this post on HN! In view of the attention, I feel like I need to emphasize that although my writing and illustrations are original, the "big idea" certainly is not. In fact, my goal was just to present the simplest example I could find that illustrates a connection between gradient descent and kernel methods. There's already a lot of work being done on this relationship, most of which I'm just now learning of.
(Although I don't have time to respond meaningfully today, I really appreciate the comments pointing out other relationships by mvcalder, Joschkabraun, yagyu and others.)
Excellent article and a really clear view on statistical model training dynamics. This perspective will no doubt contribute to the development of deep learning theory.
I'm interested especially in the lessons we can learn about the success of overparametrization. As mentioned at the beginning of the article:
> To use the picturesque idea of a "loss landscape" over parameter space, our problem will have a ridge of equally performing parameters rather than just a single optimal peak.
It has always been my intuition that overparametrization makes this ridge an overwhelming statistical majority of the parameter space, which would explain the success in training. What is less clear, as mentioned at the end, is why it hedges against overfitting. Could it be that "simple" function combinations are also overwhelmingly statistically likely vs more complicated ones? I'm imagining a hypersphere-in-many-dimensions kind of situation, where the "corners" are just too sharp to stay in for long before descending back into the "bulk".
Interested to hear others' perspectives or pointers to research on this in the context of a kernel-based interpretation. I hope understanding overparametrization may also go some way toward explaining the unreasonable effective of analog-based learning systems such as human brains.
Reflecting a bit more on the article I think the key lies close to this notion, quoted from the article:
> Since kerΠ can be described as the orthogonal complement to the set {Kti}, the orthogonal complement to kerΠ is exactly the closure of the span of the vectors Kti.
{Kti} is going to be very large in the overparametrized case, so the orthogonal complement will be small. Note also this part:
> Because v is chosen with minimal norm [in the context of the corresponding RKHS], it cannot be made smaller by adjusting it by an element of kerΠ...
So it sounds like all the "capacity" is taken up by representing the function itself and seemingly paradoxically the parameters λi are more constrained by the implicit regularization imposed by gradient descent (hypothetically enforcing the minimal-norm constraint). So the parameter space of functions that can possibly fit is tiny. The rub in practical applications is many combinations of NN parameters can correspond to one set of parameters in this kernel space, so the connection between p and λ (via f?) seems key to understanding the core of the issue.
Not out of vanity (ok, a little) but because I think the idea has importance that has not been fully explored. The article's Bayesian perspective may be the whole story but somehow I don't think so. Unlike the article's author, my work left me feeling model architecture was the most important thing (behind training data) whereas they seem to feel it is ancillary.
It's data, then loss, and then, finally, architecture. And including some additional conditioning or metadata to help prediction will often have higher value than an architecture change...
What is a kernel method? I know what an operating system kernel is. I know what the kernel of a homomorphism is. I have taken courses in computational statistics and neural networks. Yet I’ve never encountered this use of the word kernel, which the article unhelpfully does not define (yet is critically based on). Googling for kernel didn’t help either, because the term is extremely overloaded.
I guess an intuitive way to approach it given your background is by learning SVMs and how the kernel method is used to do non-linear classification. Or start with wiki article, https://en.m.wikipedia.org/wiki/Kernel_method
This seems like a young talent that we’ll see more of. I like your to the point writing style and obvious passion for mathematical clarity. Keep it up and best wishes for your phd studies.
I’d be interested in your thoughts on the case where the f_i are optimizable: f_i(t) = K(t, z_i), i=1..m << N. Like the representer thm but much fewer terms than you have data points to fit. The points z are usually called inducing points and may be optimized by gradient descent.
There is literature on approximating exact GP inference with (something like) these objects when m << N (variational inference).
However, I’m not aware of anyone drawing a clear picture of the other direction, starting from the optimization picture and explaining it in terms of inference, similar to what TFA does.
In TFA the number of functions is large, so the system is underdetermined. In the variational inference the system is overdetermined and I wonder what inference, if any, gradient descent does..
Caveat: 1am and a few drinks deep so if I’m not making sense that’s ok
Took a preliminary look through and it feels pretty exciting. Thanks for sharing.
My own post basically covers the simplest possible case where gradient descent does something related to kernels. The problem is that the "tangent kernel" driving the evolution of the model over training is typically not constant. (In my case it is constant because my model is linear.)
Domingos' solution seems to be: in general, just integrate the tangent kernel over the path taken by your optimization and call it the path kernel. Then your resulting model can always be viewed as a kernel machine, with the subtlety that the kernel now depends on the trajectory you took during training. So in that sense, kernels are everywhere! I'll take another look on Monday :)
Hah, me too! I read the article title and thought "Hm, that's a bit of a weird idea, but I suppose syscalls do add up" before jumping in and realizing my curiosity wouldn't quite be satisfied in the exact way I expected it to be.
I had Chapter 6 open while working on this post :) I should probably cite it too. As far as I could see that book doesn't talk about kernels arising from gradient descent, but it does say many other things about GP kernels. Certainly it has the main things I used.
It's amusing how facts that look surprising and mysterious to the rest of the world are just table stakes at the right sort of math department. As a researcher I feel the pressure to make things that are "my own", but there's so much that already exists just waiting to be grokked and plugged in!
Feels the same in linear algebra where you may be thinking you're building new stuff (e.g. specific trick to fit your specific kind of complex covariance toeplitz matrix parallel and vectorized batch-inversion problem with very little fast shared memory) but it turns out all the parts are already explored in ahard-to-parse textbooks somewhere).
I would need 20 hours hard study minimum to understand it at any level. I need the quanta mag version (or Karpathy video) before I comment.
Although excellent the article could start off with the practical implications (maybe it is faster GPT training?) so that at least there is a motivation.
There's quite a few takeaways that can be had without fully understanding the ah.. esoteria.
1. Gradient descent is path-dependent and doesn't forget the initial conditions. Intuitively reasonable - the method can only make local decisions, and figures out 'correct' by looking at the size of its steps. There's no 'right answer' to discover, and each initial condition follows a subtly different path to 'slow enough'...
because...
2. With enough simplification the path taken by each optimization process can be modeled using a matrix (their covariance matrix, K) with defined properties. This acts as a curvature of the mathematical space, and has some side-effects like being able to use eigen-magic to justify why the optimization process locks some parameters in place quickly, but others take a long time to settle.
which is fine, but doesn't help explain why wild over-fitting doesn't plague high-dimensional models (would you even notice if it did?). Enter implicit regularization, stage left. And mostly passing me by on the way in, but:
3. Because they decided to use random noise to generate the functions they combined to solve their optimization problem there is an additional layer of interpretation that they put on the properties of the aforementioned matrix that imply the result will only use each constituent function 'as necessary' (i.e. regularized, rather than wildly amplifying pairs of coefficients)
And then something something baysian, which I'm happy to admit I'm not across
It's pretty close to the theory end of the theory-application spectrum, so it could touch on many aspects of implementation. I suggest it may inform architectural improvements and/or better justifications for gradient descent training schedules.
Thanks. I think when I saw kernel my mind jumped to the conclusion that it was a GPU kernel so was expecting something a bit more software engineeringy!
Kernel is an overloaded term and actually there's two different usages of the word in the article itself: one referring to the kernel operator[1][2] (the primary subject of the article) and another referring to the kernel of a homomorphism[3] (essentially the complement of a subspace in a coordinate system in the linear case).
My guess would be that I have to take another look at optimizing the javascript I use for drawing the interactive widgets. It's just bare-bones javascript and canvas, but a previous version managed to freeze some browsers for unknown reasons...