The calculus answer is that gradient descent approximates the shape of your optimization space with the first order taylor expansion at your currently optimized point (first derivative information), and Newton's method approximates it with the second order taylor expansion (first and second derivative information).
The geometric answer is that gradient descent approximates the shape of your currently optimized point by a plane that is tangent to the current point, whereas Newton's approximates it with a quadratic "bowl" using the local curvature. In gradient descent you have to take short steps down the path, because you know your approximation cannot be true (you will not always be able to lower your error by moving in the same direction, if you assume there is some minimum error). In Newton's method you can assume that your bowl approximation is correct and go straight to the bottom of that bowl. If your space really does look like a bowl, you will get to the bottom super quickly.
The ELI5 answer: you are a ship captain looking for the lowest point in a part of the ocean. You cannot afford to send a submersible down, so your approach is to send an anchor down to check the depth and take a panoramic photo a few feet around the anchor. After analyzing the results of each anchor drop you can move your ship anywhere and try again. In gradient descent you look for the lowest point on the edge of the camera's vision and move in that direction. You can move as far as you'd like, but you will not go too far because you know the drop will not go on forever. In Newton's method you will look at the curvature in the pano and extrapolate it out assuming it is a "bowl" shape. For example maybe your anchor is on an underwater J-shaped cliff getting less steep as it goes north, so you map it out as a U shape that starts going up again if you go a mile north. So, you move your ship a mile north and try again there. In gradient descent you would have moved maybe 100 feet north then checked again. Newton's method can intuitively get you to the low point in very few anchor readings, but crunching the numbers of the bowl can be very time-consuming (this becomes relevant in higher dimensions). In this analogy, the long and lat coordinates in the ocean are your two parameters to be optimized, and the height of the ocean floor is the error in your optimization. You want to find the parameters that give you the lowest error.
Great explanation, and the ship example could make for a fun math game! I'm in NYC also and interested in creating spaces for mathematics, will be at the Museum of Mathematics hackathon next week. Drop me a line if you're interested in chatting, sandy.vanderbleek@gmail.com
The geometric answer is that gradient descent approximates the shape of your currently optimized point by a plane that is tangent to the current point, whereas Newton's approximates it with a quadratic "bowl" using the local curvature. In gradient descent you have to take short steps down the path, because you know your approximation cannot be true (you will not always be able to lower your error by moving in the same direction, if you assume there is some minimum error). In Newton's method you can assume that your bowl approximation is correct and go straight to the bottom of that bowl. If your space really does look like a bowl, you will get to the bottom super quickly.
The ELI5 answer: you are a ship captain looking for the lowest point in a part of the ocean. You cannot afford to send a submersible down, so your approach is to send an anchor down to check the depth and take a panoramic photo a few feet around the anchor. After analyzing the results of each anchor drop you can move your ship anywhere and try again. In gradient descent you look for the lowest point on the edge of the camera's vision and move in that direction. You can move as far as you'd like, but you will not go too far because you know the drop will not go on forever. In Newton's method you will look at the curvature in the pano and extrapolate it out assuming it is a "bowl" shape. For example maybe your anchor is on an underwater J-shaped cliff getting less steep as it goes north, so you map it out as a U shape that starts going up again if you go a mile north. So, you move your ship a mile north and try again there. In gradient descent you would have moved maybe 100 feet north then checked again. Newton's method can intuitively get you to the low point in very few anchor readings, but crunching the numbers of the bowl can be very time-consuming (this becomes relevant in higher dimensions). In this analogy, the long and lat coordinates in the ocean are your two parameters to be optimized, and the height of the ocean floor is the error in your optimization. You want to find the parameters that give you the lowest error.