The reason the gradient is preferred (as I understand it) is actually computational considerations. Assuming you can actually compute the both of them, a Newton method (uses the gradient and the derivative of the gradient, the Hessian) usually has faster convergence (quadratic instead of linear). However that Hessian can be big and difficult to compute, and then you need a linear solve with it.
However in most ML applications, you don’t even have the full gradient. You have a stochastic estimate since your loss is generally additive in the data. So you’re not (as far as I know) even going to bother trying to form a Hessian. I believe many have investigated quasi-Newton methods based on estimate gradients but I haven’t investigated that thoroughly.
This is a really important point and I wish I'd mentioned it. The computational considerations (as you've said) make the classic Gradient Descent method infeasible in practice. Therefore we resort to stochastic estimates or Quasi Newton approaches (which I'm still looking into).
My main objective was to highlight is that given that we are performing the classic gradient descent, the gradient will yield the greatest reduction in the function value. Essentially it was a point to highlight the underlying calculus. Wayne Winston in his book Operations Research: Applications and Algorithms has an interesting passage where he discusses the gradient being the direction of maximum increase (he was looking as steepest ascent).
> You have a stochastic estimate since your loss is generally additive in the data.
Just to be clear, additive loss doesn't imply stochastic gradient estimate. Rather, because the loss function is additive, then stochastic gradient estimates of the loss are now possible. But, this of course does not mean one has to use stochastic gradient estimates.
It's just that it's easier to update and monitor progress this way, rather than computing the gradient term for every single example in the training set and then taking a descent step. The surprising thing is that stochastic gradient descent convergences quickly in practice relative to proper gradient descent. All of the justification and whatnot for SGD for ML is largely post-hoc because it works so unreasonably well and is so intuitive to anyone having taken calculus.
The other aspect (with respect to the context of optimization in machine learning) is that this optimization is performed over a loss over a training dataset for which you really don't even want convergence to an exact minima over the training loss. What you really care about is the expected generalization loss. Convergence to the exact minima over training loss doesn't necessarily guarantee the best generalization loss. I mention this because it contributes to the general aloofness towards optimization convergence rates in ML.
> I believe many have investigated quasi-Newton methods based on estimate gradients but I haven’t investigated that thoroughly.
Until semi-recently, quasi-newton was not explored in the stochastic setting because of the question of how to extend the Wolfe conditions to this arena. There's been a bit of work on this [1], but I don't think it's caught on outside of the optimization community (not that it necessarily should considering the points above).
Your point about convergence to the exact minima over training loss not guaranteeing the best generalization loss reminds me of the point made in this lecture here https://www.youtube.com/watch?v=k3AiUhwHQ28.
You also made an interesting comment about work not catching on outside of the optimization community - can you recommend some resources or websites to follow in order to see what the optimization community is working on? I've developed an interest in the area but don't really know where to go for "up to date" information.
I always find it interesting to see the different paths people take toward learning the same thing. When I first did multivariable calculus, I learned that the gradient points uphill, and the negative of the gradient points downhill. I'm definitely a spacial learner, and mostly thought of surfaces the way one walks over hills. The idea of using gradient descent to find a local minimum is the simplest part of neural networks to me.
It's interesting to see someone first write an article about nearest neighbor classifiers (a topic I really don't know much about), and then, 2-3 months later, figure out why we use gradient descent.
Yeah thats what the gradient literally is defined as = Rate of change. Not trying to be snarky but for me this means not taking the time to learn the basics before jumping into far far advanced concepts. Sadly this seems to be a pattern in a lot of machine learning curriculum today.
When private schools offer a 1-year course to become a "Machine Learning Consultant" with no prior mathematics or programming knowledge required, you know that something has to be off ...
I mean, a quality program will include both numerical analysis and optimization methods classes in their program, both which will (hopefully) go through things like gradient descent in rigour.
And long before that students will hopefully have a solid fundamental knowledge in what rate of change means - not just as in banging out derivatives on paper.
Thank you for taking the time to look through the post. When I learned this formally it was introduced as steepest descent (Cauchy's variant). Like yourself, it didn't become concrete to me until I looked at the surface plots. I concur with your point that it's interesting to see the different paths people take toward learning the same thing.
I too, like commenter above, find it fascinating how various concepts are quickly accessible (or not!) for different people. Usually I blame it on the teacher's presentation (for my part, I am completely unable to explain pointers in (say, C) to people who haven't already been able to get it), but sometimes it is also the building blocks that the learner already has in place.
So - maybe a rude question - but had you take Calculus, Differential Equations or Vector Calculus (div, grad and curl and all that) before you started in on this more integrated material?
Not rude at all! Thanks for the question. My background is in Statistics not Machine learning (that came later) so I covered these topics without reference to ML applications. I suppose I never learned to connect these ideas when I was learning about ML, it was only when I went back to my old material I realised how these ideas relate. I agree with your point about teacher material, when I was learning about ML it was separated from raw Calculus.
A short answer to your question - yes I took those courses before I started looking at the more integrated material.
I think there may be some confusion as to the terminology in the article. The gradient does not necessarily lead to the greatest decrease in the function. Specifically, if our function of interest is f and we're currently at the point x, then f(x - alpha grad f(x)) for some alpha may or may not be less that f(x + beta dx) for a different dx and beta.
For example, consider the quadratic f(x) = 0.5 x' A x - b' x where ' denotes transpose and we have implicit multiplication. Let A = [2 1;1 2], b = [3; 4], x = [5;6]. This is convex and quadratic with a global minima of [2/3;5/3], inv(A) b. Now, grad f(x) = A x - b = [13;13] and moving in the direction [-13;-13] will not get us there in a single step. However, dx_newton = -inv(hess f(x)) grad f(x) = [-4 1/3, -4 1/3], which brings us to the global minima in a single step.
The value in gradient descent is that, combined with an appropriate globalization technique such as a trust region or a line search, it guarantees convergence to a local minima. Newton's method does not unless close enough to the minima. As such, most good, fast optimization algorithms based on differentiable functions use the steepest descent direction as a metric, or fallback, to guarantee convergence and then use a different direction, most likely a truncated-Newton method, to converge quickly. Meaning, the gradient descent direction rarely leads to the greatest decrease. Unless, of course, we want to make an argument in an infinitesimal sense, which fine, but I'd denote that explicitly.
That's a great point and to be honest I could have been a lot tighter with the terminology. Good advice to take on board for next time - thanks!
Your point about combining optimisation techniques is interesting and I'd love to learn about it a little more. When you say "As such, most good, fast optimization algorithms based on differentiable functions use the steepest descent direction as a metric, or fallback, to guarantee convergence and then use a different direction, most likely a truncated-Newton method, to converge quickly", does this mean that both algorithms are being used together? So first steepest descent is run for a few iterations and then the truncated-Newton method takes over?
If you have some resources where I could read up on this it would be much appreciated!
Though I have complaints with it, Numerical Optimization by Nocedal and Wright is probably the best reference for modern optimization techniques. My complaint with it is that they also present many historical techniques that I would argue should not be used and don't provide clear guidance as to what are the modern, robust algorithms. And, to be sure, arguments can be made for all sorts of algorithms, but I will contend: (unconstrained) trust-region newton-cg [algorithm 7.2 in Numerical Optimization], (equality) composite-step SQP method [algorithm 15.4.1 in Trust-Region Methods by Conn, Gould, and Toint], (inequality) NITRO interior point algorithm [algorithm 19.4 in Numerical Optimization], (equality and inequality) combination of the above. There are many implementation nuances with these algorithms and they can be made better than their presentation, but I believe them to be a good starting point for modern, fast algorithms.
As far as switching back and forth between the Newton and gradient descent steps, this is largely done in a class of algorithms called dogleg methods. Essentially, the Newton step is tried against some convergence criteria. If it satisfies this criteria, it takes a step. If not, it reduces itself until eventually it assumes the gradient descent step. I'll contend that truncated-CG (Steihaug-Toint CG) does this, but better. Essentially, it's a modified conjugate gradient algorithm to solve the Newton system that maintains a descent direction. The first Krylov vector this method generates is the gradient descent step, so it eventually reduces to this step if convergence proves difficult.
More broadly, there's a question of whether all of the trouble of using second-order information (Hessians) is worth it away from the optimal solution. I will contend, strongly, yes. I base this on experience, but there are some simple thought experiments as well. For example, say we have the gradient descent direction. How far should we travel in this direction? Certainly, we can conduct a line-search or play with a "learning parameter". Also, if you do this, please use a line-search because it will provide vastly better convergence guarantees and performance. However, if we have the second derivative, we have a model to determine how far we need to go. Recall, a Taylor series tells us that f(x + dx) ~= f(x) + grad f(x)'dx + 0.5 dx' hess f(x) dx. We can use this to figure out how far to travel in this direction where we try to find an optimal alpha such that J(alpha) = f(x + alpha dx) = f(x) + alpha grad f(x)'dx + (alpha/2) dx' hess f(x) dx. If dx' hess f(x) dx > 0, the problem is convex and we can simply look for when J'(alpha) = 0, which occurs when alpha = -grad f(x)' dx / (dx' hess f(x) dx). When dx' hess f(x) dx < 0, this implies that we should take a really long step as this is predicting the gradient will be even more negative in this direction the farther we go. Though both methods, must be safeguarded (the easiest is to just halve the step if we don't get descent), the point is that the Hessian provides information that the gradient did not and this information is useful. This is only one place where this information can be use, others include in the direction calculation itself, which is what truncated-CG does.
As a brief aside, the full Hessian is rarely, if ever, computed. Hessian-vector products are enough, which allows the problem to scale to really anything that a gradient descent method can scale to.
As one final comment, the angle observation that you make in the blog post is important. It comes in a different form when proving convergence of methods, which can be seen in Theorem 3.2 within Numerical Optimization, which uses expression 3.12. Essentially, to guarantee convergence, the angle between the gradient descent direction and whatever we choose must be controlled.
Thank you for taking the time to write a thorough and considerate response. I have been working through the Engineering Optimization Methods and Applications by Ravindran, Ragsdell and Reklaitis so far but I will spend some time in the coming few weeks with Nocedal and Wright in accordance with your recommendation.
I intend to write more about what I learn in this area and I'd be honoured if you would contribute like you did here with your comments/ corrections and suggestions! Thank you for the help and reference, I will definitely be following up.
What I find interesting is the implicit assumption that the underlying function being learned is differentiable or continuous to begin with. That's not always the case; for example, we often work with "categorically labeled" discrete binning problems.
In those cases, people tend to use something like a categorical cross-entropy loss where you assign a continuous likelihood score to each discrete possibility, thereby making things differentiable again.
Yeah but it’s still enough to stall and hinder newbies (me) that deal primarily with categorical data until you can start to intuitively map the continuous back into the discrete.
Nature of the beast it seems but still kind of a pain.
The gradient is defined as a limit. That it points to the direction of greatest increase is a consequence of the definition, not the definition itself.
Small side note, don't forget to escape your trig functions in TeX! Else it will render your "cos" in italitcs (product of variables c, o and s) as in this article, and thousands of others; the only thing more surprising than people not noticing this is TeX not giving a warning about it.
Please try this exercise and report back :) Suppose it happened that when setting up your parameter space you found yourself working with ξ and η instead of x and y, where the relationship is given simply by (ξ, η) = A (x, y) for A an invertible linear mapping (2×2 matrix). This could easily happen in practice. Is gradient descent in (ξ, η) the same procedure as gradient descent in (x, y)? What should we make of any difference?
However in most ML applications, you don’t even have the full gradient. You have a stochastic estimate since your loss is generally additive in the data. So you’re not (as far as I know) even going to bother trying to form a Hessian. I believe many have investigated quasi-Newton methods based on estimate gradients but I haven’t investigated that thoroughly.