The calculus answer is that gradient descent approximates the shape of your optimization space with the first order taylor expansion at your currently optimized point (first derivative information), and Newton's method approximates it with the second order taylor expansion (first and second derivative information).
The geometric answer is that gradient descent approximates the shape of your currently optimized point by a plane that is tangent to the current point, whereas Newton's approximates it with a quadratic "bowl" using the local curvature. In gradient descent you have to take short steps down the path, because you know your approximation cannot be true (you will not always be able to lower your error by moving in the same direction, if you assume there is some minimum error). In Newton's method you can assume that your bowl approximation is correct and go straight to the bottom of that bowl. If your space really does look like a bowl, you will get to the bottom super quickly.
The ELI5 answer: you are a ship captain looking for the lowest point in a part of the ocean. You cannot afford to send a submersible down, so your approach is to send an anchor down to check the depth and take a panoramic photo a few feet around the anchor. After analyzing the results of each anchor drop you can move your ship anywhere and try again. In gradient descent you look for the lowest point on the edge of the camera's vision and move in that direction. You can move as far as you'd like, but you will not go too far because you know the drop will not go on forever. In Newton's method you will look at the curvature in the pano and extrapolate it out assuming it is a "bowl" shape. For example maybe your anchor is on an underwater J-shaped cliff getting less steep as it goes north, so you map it out as a U shape that starts going up again if you go a mile north. So, you move your ship a mile north and try again there. In gradient descent you would have moved maybe 100 feet north then checked again. Newton's method can intuitively get you to the low point in very few anchor readings, but crunching the numbers of the bowl can be very time-consuming (this becomes relevant in higher dimensions). In this analogy, the long and lat coordinates in the ocean are your two parameters to be optimized, and the height of the ocean floor is the error in your optimization. You want to find the parameters that give you the lowest error.
Great explanation, and the ship example could make for a fun math game! I'm in NYC also and interested in creating spaces for mathematics, will be at the Museum of Mathematics hackathon next week. Drop me a line if you're interested in chatting, sandy.vanderbleek@gmail.com
This looks at the algorithm the same way that the OP does (mathematically, visually, programmatically), only it applies it to a simpler Linear Regression model.
In short, Newton's method uses second order derivative information in the search direction, while gradient descent only uses first order derivative. In between, there are "quasi-newton" methods which include generalizations of the "secant method". I should also mention that there are all sorts of ad-hoc approaches for attempting to increase the convergence rate of gradient descent, e.g., pre-conditioning, "momentum" terms, etc.
Since the author is reading, a few small typos, followed by one slightly more substantial comment: 'simgoid' should be 'sigmoid' (S-shaped); `x y = log(x) + log(y)` should be `log(x y) = log(x) + log(y)`;'guarentee' should be 'guarantee'; 'recipricol' should be 'reciprocal'.
I would like to see some mention of the fact that the division by the gradient is a meaningless, purely formal motivation for the correct step (inverting the Hessian) that follows.
The motivation for the Hessian is the same as for dividing by the second derivative. Suppose we want to solve f(x) = 0. Taylor expand f(x) around the current iterate (a)
f(x) =~ f(a) + f'(a)(x-a)
We want f(x) = 0 so,
0 = f(a) + f'(a)(x-a)
Multiply both sides by the inverse of f'(x):
0 = f'(a)^-1 f(a) + x-a
So:
x = a - f'(a)^-1 f(a)
This is the update equation for Newton's method where a is the current iterate and x is the next iterate.
If f is a multi dimensional function f : R^n -> R^n then the derivative f'(a) is the Jacobian, and inversion becomes matrix inversion.
When we use Newton's method for minimisation of a function g we solve g'(x) = 0, so we pick f(x) = g'(x). Since the formula above contains f' we get a second derivative. The second derivative in multiple dimensions is the hessian.
I'd also like to note that the Hessian matrix elements should have a \partial{\partial{l}} in each numerator, not a single \partial{l} [1].
If you aren't using LaTeX for formatting, then think partial^2 of l. FWIW: I just found this[2] which would make (using the physics package) even simpler to represent.
> As to the motivation for the correct step: can you point me to a resource that explains this? Not sure I follow...
You write an equation involving division by the gradient. This is an illegal operation (one cannot divide by a vector), and your final recipe doesn't do it. As far as I can tell, you are writing down the incorrect, illegally-vector-inverting formula as motivation for the correct formula involving the (inverse of the) Hessian. All I am suggesting is that you say explicitly something like "Of course, this formula as written is not literally correct; one cannot actually divide by a vector. The correct procedure is explained below."
(Incidentally, speaking of inverses, another poster (https://news.ycombinator.com/item?id=14881265) has mentioned that it may be a bit confusing to speak of the inverse of a matrix rather than the reciprocal, since (as I interpret that other poster's point) the reciprocal of a matrix is just its inverse. I might prefer to say something like "We write $H_{\ell(\theta)}^{-1}\nabla\ell(\theta)$ rather than $\frac{\nabla\ell(\theta)}{H_\ell(\theta)}$ to emphasise that we are inverting a matrix, not a scalar, so that the order of multiplication matters.")
Bishop has a nice treatment of Newton's method in "Pattern recognition and machine learning". Good book to have on your shelf of you are learning this stuff.
Dialing up the complexity a bit from Newton's method, it would be interesting to know whether there are now better explanations of the conjugate gradient method online than this classic (or at least high-profile) intro: https://www.cs.cmu.edu/~quake-papers/painless-conjugate-grad...
I just took the index of the data point (i.e. 1, 250) as the y-axis with the intention of "Stretching out" the data set along the y-axis. Otherwise the data would otherwise be illegibly compressed on a 1-D number line.
In retrospect, I could have better represented the data with 2 overlying histograms, but this (somewhat) captures the intent of showing that "more expensive houses tend to have more than 2 bathrooms".
This is a really nice introduction to logistic regression, well done! My one quibble with the OP is the jump into Newton's method. Maybe a derivation to explain the five steps would help. Thanks!
Thank you. I err'ed on the side of caution to (try) to reduce information overload, but I also didn't want to assume too much prior knowledge. Delicate balance...
Maybe a linked post that dives deeper into the 5 steps of Newton's Method? Would that be more approachable?
Yeah, that would work. Honestly, when I studied logistic regression, we used gradient descent, so seeing it with Newton's method was the draw for me. I can go find the resources myself of course, but since I was reading the article I thought it would be convenient there :)
I search everywhere about the difference between Newton's Method and Gradient Descent? but I couldn't find something that useful. Can u suggest any website/ article where I can learn the difference?
The author has far too little mathematical understanding to be teaching anybody (that’s my impression at least). If you don’t understand Newton’s method before reading this, you won’t understand it afterwards. “A method for finding the roots of a polynomial”. Why polynomials? Does it work, always? Is it fast? Why would following the tangent repeatedly be a good idea? “We take the inverse instead of the reciprocal because it’s a matrix”. Not impressed.
I am a programmer trying to learn math, so are my intended audience members.
That said, I should include facts related to convergence, and maybe even speed compared to SGD.
As to the reciprocal -> inverse generalization, do you have any resources you could point we towards to better understand this?
Additionally, a concrete answer to "Why would following the tangent repeatedly be a good idea?" has been hard to come by for me. I am able to visualize this, but if you have resources that explain this well please share.
In general, it’s not a good idea. And in general, Newton’s method won’t converge.
Newton’s method boils down to replacing your function by a first-order approximation. For a differentiable function, in a small neighbourhood(!), that’s a good approximation (by definition), though, and the zero of the model function will be very close to the zero of the original function (if it lies in that neighbourhood).
PS: i did not expect the poster and author to be the same person, otherwise I would’ve phrased my criticism differently. A SHOW HN would have helped.
PPS: basically the whole reciprocal/inverse confusion only arises because you start the multidimensional case from your iteration formula. If you back to its derivation, and start again from there, you can avoid that.
> In general, it’s not a good idea. And in general, Newton’s method won’t converge.
Right, but this blog post isn't about the general case of using Newton's method to find roots, it's about using Newton's method for solving logistic regression for which it is perfectly suited, though there are better methods as well, of course.
To elaborate a bit: Newton’s method is amazing. But it’s also dangerous and difficult to use. You can talk about it for a long time and use it as the centrepiece of an course on vector analysis. This treatment does not explain anything - it might as well just say “we use the magical newton method, a discussion of which is beyond the scope of this work”. I’d find that honest.
Back when I was in high school Newton's Method was typically introduced as a method to find roots of a polynomial, usually in "Algebra 2" or Trigonometry or about then. That's because there is an easy to grasp formula for it that can be used in the context of a very familiar mathematical construct (polynomials) and produced from differentiation that doesn't technically require an understanding of calc to use (as opposed to actually understand, derive, etc.). I still use it pre-calc when I tutor younger students because it's a great exploratory tool.
Edit: Found an answer: https://www.quora.com/In-optimization-why-is-Newtons-method-...