The Matrix Calculus You Need for Deep Learning

hypersoar · on June 29, 2018

Does anyone know of good resources for studying machine learning or data science given a strong mathematical background? I'm transitioning careers from pure math research into industry. I know very little about machine learning, but I know the crap out of linear algebra and real analysis (and other, less relevant fields). It'd be great to have some sources that leverage that without assuming much prior CS knowledge.

soVeryTired · on June 29, 2018

For non-deep learning, read David Barber's book:

http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/090310.pdf

Some sections may be less relevant, depending on what you want to do, but Section III is a very good introduction to machine learning methods.

Do the exercises as you're reading. Theory is one thing, but in ML my rule of thumb is that you don't really understand a model until you've coded it up. A collection of written exercises would be a good way to impress an interviewer, too.

glitch · on July 2, 2018

Note: 090310.pdf (9 March 2010) is an older version of the textbook. The latest revision of the textbook PDF (091117.pdf, 9 November 2017) is available at:

→ http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/091117.pdf

Aside:

⸰ Textbook homepage: http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=...

⸰ Online version homepage (should always contain link to latest revision): http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=...

⸰ Directory sorted by date: http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/?C=M;O=D

nightski · on June 29, 2018

It really is a great book. However my friend & I attempted to make it through and faced quite a bit of struggle. Eventually we'd figure things out but it felt like it would of been significantly easier with the help of a teacher/mentor to ask questions to. Unfortunately it's actually kind of hard to find colleges nearby that even have courses in Bayesian statistics (from an inference perspective). That was frustrating.

man-and-laptop · on June 29, 2018

It's far from state of the art, but reading Computer Vision by Simon J D Prince [1] alongside David Barber's BRML can really help.

[1] - http://www.computervisionmodels.com/

nightski · on June 29, 2018

I had this book on my computer back from when it was released, but never got around to reading it. Thank you for the recommendation!

mindviews · on June 29, 2018

For Machine Learning, I found "Learning From Data" http://amlbook.com/ to be a very strong, foundational book. It does the best job of any reference I've seen of answering the question: "Is learning possible and feasible?" Provides and excellent mathematical foundation for understanding when and why ML systems should succeed and fail. Associated MOOC from Caltech at https://work.caltech.edu/telecourse.html

saamm · on June 30, 2018

My ML professor used this book, and I loved it. Some presentations of ML make it seem like a bag of tricks; this feels like the opposite of that.

Eridrus · on July 1, 2018

I'm a big fan of the MOOC, even though it didn't explicitly cover it, it basically explained what people mean when they say "we don't know why neural nets work", which is that our previous generalization bounds would imply that neural nets would not generalize well, but in practice they do.

cs702 · on June 29, 2018

If you have at least some coding experience and you are interested in the practical aspects of ML/DL (i.e., you want to learn the how-to, not the why or the whence), my recommendation is to start with the fast.ai courses by Jeremy Howard (co-author of this "Matrix Calculus" cheat sheet) and Rachel Thomas[a]:

* fast.ai ML course: http://forums.fast.ai/t/another-treat-early-access-to-intro-...

* fast.ai DL course: part 1: http://course.fast.ai/ part 2: http://course.fast.ai/part2.html

The fast.ai courses spend very little time on theory, and you can follow the videos at your own pace.

Books:

* The best books on ML (excluding DL), in my view, are "An Introduction to Statistical Learning" by James, Witten, Hastie and Tibshirani, and "The Elements of Statistical Learning" by Hastie, Tibshirani and Friedman. The Elements arguably belongs on every ML practitioner's bookshelf -- it's a fantastic reference manual.[b]

* The only book on DL that I'm aware of is "Deep Learning," by Goodfellow, Bengio and Courville. It's a good book, but I suggest holding off on reading it until you've had a chance to experiment with a range of deep learning models. Otherwise, you will get very little useful out of it.[c]

Good luck!

[a] Scroll down on this page for their bios: http://course.fast.ai/about.html

[b] Introduction to Statistical Learning: http://www-bcf.usc.edu/~gareth/ISL/ The Elements of Statistical Learning: https://web.stanford.edu/~hastie/ElemStatLearn/

[c] http://www.deeplearningbook.org/

thanatropism · on June 29, 2018

There are many new methods cropping up that most people in the data science hype train will be full-on unable to access, including methods on manifolds (even kernel methods on manifolds) and algebraic-topological methods (persistent homology) with enough maths to give Kagglers the screaming meemies.

I'm using some of those for $(redacted, the plan is to make money). Don't follow the crowd.

mjw · on June 29, 2018

When I started out in ML I was really keen to learn about the most 'mathsy' approaches out there.

I think with hindsight, it's great to have a broad spectrum of methods available to you, but if you focus too much on methods at the hard-math end of the spectrum just for the sake of an intellectual challenge, you can end up fixated on an exotic solution looking for a problem while the rest of the field moves on, rather than doing useful engineering people care about.

Maybe you find a niche where something exotic really helps, maybe you don't -- maybe for research this is a risk worth taking. But just something to keep in mind.

IMO: breadth is good. Mathematical maturity helps. If one sticks around one finds uses for interesting maths eventually, but not worth trying to force it.

Another avenue for people who want to use some hardcore math: try and use it to find some good theory around why things which work well, work well. Not an easy task either by any means.

spitfire · on June 29, 2018

Any papers or books to get started with this? I'm in ML but not on the DL hype train - disappointed by its limits.

tzahola · on June 29, 2018

Math? Meh.

Just throw more GPUs at it bro. Gradient descent 4ever! xD

/s

mlthoughts2018 · on June 29, 2018

I made the same transition earlier in my career. One book on deep learning that meets your requirements is [0]. It’s readable, covers a broad set of modern topics, and has pragmatic tips for real use cases.

For general machine learning, there are many, many books. A good intro is [1] and a more comprehensive, reference sort of book is [2]. Frankly, by this point, even reading the documentation and user guide of scikit-learn has a fairly good mathematical presentation of many algorithms. Another good reference book is [3].

Finally, I would also recommend supplementing some of that stuff with Bayesian analysis, which can address many of the same problems, or be intermixed with machine learning algorithms, but which is important for a lot of other reasons too (MCMC sampling, hierarchical regression, small data problems). For that I would recommend [4] and [5].

Stay away from bootcamps or books or lectures that seem overly branded with “data science.” This usually means more focus on data pipeline tooling, data cleaning, shallow details about a specific software package, and side tasks like wrapping something in a webservice.

That stuff is extremely easy to learn on the job and usually needs to be tailored differently for every different project or employer, so it’s a relative waste of time unless it is the only way you can get a job.

[0]: < https://www.amazon.com/Deep-Learning-Adaptive-Computation-Ma... >

[1]: < https://www.amazon.com/Pattern-Classification-Pt-1-Richard-D... >

[2]: < https://www.amazon.com/Pattern-Recognition-Learning-Informat... >

[3]: < http://www.web.stanford.edu/~hastie/ElemStatLearn/ >

[4]: < http://www.stat.columbia.edu/~gelman/book/ >

[5]: < http://www.stat.columbia.edu/~gelman/arm/ >

Iwan-Zotow · on June 29, 2018

Goodfellow book [0] is available for free, http://www.deeplearningbook.org/

soVeryTired · on June 29, 2018

+1 for Gelman, but I hate Bishop's book [2]. It was an early go-to reference in the field, but there are better books out there now.

hikarudo · on June 29, 2018

What do you hate about Bishop's book? I'm genuinely curious.

soVeryTired · on June 29, 2018

Honestly, I don't understand the way he explains things. The maths is difficult to follow, and it just never clicks for me. Maybe he's writing for someone with a physics background or something, but I feel stupid when I read bishop.

I just read over his description of how to transform a uniform random variable into a variable with a desired distribution (p. 526). It's a fairly easy trick, but if I didn't already know it I wouldn't understand his explanation

bllguo · on June 29, 2018

I'm trying to read through it and I have to agree, his math isn't that clear to me. What do you recommend?

soVeryTired · on June 30, 2018

David Barber!

Iwan-Zotow · on June 29, 2018

Sure, Goodfellow book

http://www.deeplearningbook.org/

srean · on June 30, 2018

Given your math bent expecially analysis, I would recommend you start with Luc Devroye, follow it up with Vapnik. These are a bit dated but still very relevant. After these two go with Alex Smola and Bernard Scholkopf. The books that have been suggested to you so far aren't bad but with high fluff to meat ratio, not meant for a reader like you. If you have function analysis in your bag you will feel right at home with the books I suggested. Deep NN are much in the news, but on the math side of the practice of DNN, there is not much to it apart from chain rule for differentiation.

poster123 · on July 1, 2018

The book "A Probabilistic Theory of Pattern Recognition" by Devroye, Györfi and Lugosi is available from the site of Gyorfi at http://www.szit.bme.hu/~gyorfi/pbook.pdf .

poster123 · on July 1, 2018

Thanks for the recommendations. Devroye wrote several books, and I assume the one being mentioned is "A Probabilistic Theory of Pattern Recognition".

srean · on July 1, 2018

Yes that's the one I had in mind. His other books are also very good, but are more specialized.

You may also enjoy graycat's comments on HN. He know his math but is contrarian about machine learning but its good to have that view point.

ilzmastr · on June 29, 2018

Are you only interested in ML practicum? What did you take after real analysis?

What are your thoughts/interests on analysis for ML, like the approximation theory that branched from Fourier and Wavelet analysis catching on with Cybenko [1], continuing with among others Mhaskar [2], most recently added to by Bolcskei [3]. And then there are other areas where analysis applies like studying the optimization of such networks...

(And of course, this is just for NNs. There are other areas of research where analysis comes into ML. And of course, real analysis lays the foundations for probability and statistics, and is not abstracted away in many research areas in these fields.)

[1](https://pdfs.semanticscholar.org/05ce/b32839c26c8d2cb38d5529...) [2](https://pdfs.semanticscholar.org/694a/d455c119c0d07036792b80...) [3](https://arxiv.org/abs/1705.01714)

utk09 · on June 30, 2018

https://github.com/llSourcell/Learn_Machine_Learning_in_3_Mo...

This is also a good resource for MachineLearning

RobertDeNiro · on June 29, 2018

You don't need much CS knowledge for deep learning, in fact pure math/stats is more useful. I'd say CS knowledge is only useful if you do RL stuff and even then you don't really need it that much.

tachyonbeam · on June 29, 2018

That really depends on what you're doing. I've seen people coming from a pure math background really struggle. If you've never programmed before, then using something like TensowFlow isn't exactly going to feel natural. Familiarity with Python and the Linux ecosystem would definitely be quite helpful. Yes, I'm sure you could run ML models on a Windows machine, but at some point, you might want to perform experiments on a cluster, and it's going to be running Linux.

RobertDeNiro · on July 1, 2018

True. I was assuming programing experience but no data structure and algorithms. I can attest that a lot of people in my ML class couldn't program and they sucked at the assignments we had.

hypersoar · on June 29, 2018

Tell that to my potential employers :|

RobertDeNiro · on July 1, 2018

Depends what jobs you're applying to. But I've seen it first hand traditional HR doesn't know how to filter for ML resumes.

luk32 · on June 29, 2018

If someone likes more lecture style explanation I can recommend 3blue1brown's material on YouTube. He explained in a pretty good an accessible way imho.

I didn't learn artificial neural network stuff from there. I knew those concepts but I didn't know the matrix formalism applied to it. So this was really nice to understand why GPUs are good for this. Math-wise it was really nice watch.

tw1010 · on June 29, 2018

It's amazing what a rich-get-richer effect products and content that really manage to solve problems in a high quality way get in comments sections around the web. (E.g. 3B1B.)

Twisol · on June 29, 2018

I'm not really sure why you're being downvoted. You're right that 3B1B really does manage to solve a problem in a high-quality way, and it's amazing how much of an effect that really has on people, especially considering the relatively niche topics that 3B1B goes into. (You'd think that his "Essence of" series would be more popular, but the one-off problem analyses have ridiculously more views in general.)

In some ways, it is a "rich-get-richer" effect. But creators like 3B1B expend a lot of time and resources to do what they do, and the word of mouth he gets is an acknowledgement that the work he does is worth the money and views we provide.

closetCS · on June 29, 2018

Is that such a bad thing?

The web has allowed the sharing of high quality content across the world for little to no cost, but has also created so much noise for the average user that they have little to no hope of finding the high quality content on their own. Comment sections across the web fix this problem by promoting producers that offer a superior product. This encourages everyone to make better content.

shrimp_emoji · on June 29, 2018

Yep. And 3b1b seems to agree: https://youtu.be/VcgJro0sTiM

roel_v · on June 29, 2018

So I have a question somewhat related to this that I never knew where/who to ask (well actually I asked a few mathematicians at a university I work with whose answers I couldn't understand - their answers were almost as impenetrable as the Wikipedia page, and some engineering scientists who I thought would be more into 'applied math' but they didn't know. So I'm hoping some data science people reading this would better understand where I'm coming from, and be able to explain at my primitive math skills level).

So, some time ago I contracted out writing some code for fitting a logistic regression onto a given set of observations. There were some specific requirements, but I think I should have been able to piece something myself together using mainstream LA libraries; some of them even hint at 'you could fit a logistic regression using these functions' but no complete examples. But I didn't understand well enough so I contracted it out.

The woman who ended up writing the code used a 'Hessian' matrix to do so (she actually wrote two functions doing the same thing, one used this Hessian approach - I think the idea was that it would be faster but there wasn't a lot of time and it never got tweaked enough to make a difference).

So my question - is there a layman's explanation for what a Hessian matrix is, and how it applies to fitting a logistic model? Also (with an eye to the future of my project), does it have applications for non-linear regressions?

Alternatively, are there any books where this is covered? I have most standard stats/applied stats/operations research books, as well as a few like the no bullshit guide to linear algebra, but none cover this specific issue - or even how to fit a logistic regression at all, on a practical level (so not just 'conceptually you do xyz, implementation is left as an exercise for the reader').

archgoon · on June 29, 2018

So, when you perform a regression of any sort, what you're doing is saying "Hey, I want to find parameters X,Y,Z, etc, that make this curve best fit the data that I have". One interpretation of 'best fit' is 'minimize the mean squared error'.

So regression is just a minimization problem. You're trying to find the values that minimize

  f(X,Y,Z...)

And, well, that means that you just want to find values for X,Y,Z such that

  ∇f(X,Y,Z) = 0

Why? Because the derivative tracks the rate of change. If the function isn't changing much locally, you've hit either a minimum or maximum (assuming the function is smooth).

Let's see how this plays out when you have a function that you're trying to minimize that only has 1 parameter. This is then just a regular old function of one variable, and we can easily visualize it.

  |     | |     x^2 + 2x
   \    |/
  ------------
     \_/|

And take it's dervative

       |   / 
       |  /
       | /      2x + 2
       |/
       |
      /|
     / |
  ------------

And then you'd say, oh hey I know the roots of 2x + 2 ! It's just x = -1. So x^2 + 2x would have a minimum at -1.

But when dealing with many functions you may not know the functional form, or there may not be a way to solve for the roots symbolically (or you don't know the tricks to do so), you might want to try a strictly numerical approach. You say "Look, I know how to compute the original function, can't I just use that directly?"

Well, one option would be to find two points on your function, one where it's positive, the other where it's negative, and then keep bisecting the interval down until you zoom in on the point where it crossed the x axis (with additional logic to handle multiple crossings). Of course, if you're function is discontinuous, or is discontinuous due to a floating point error, you're going to be hosed. This is the bisection method. However, it's not as fast as is desirable. Since each decimal point is an increase by a factor of 10 precision, and you typically want several decimal points at least, simply increasing your precision by a mere factor of 2 each time may not be good enough. This is especially true if you're doing this by hand. :)

So a faster method is to start at a point, and then draw a line tangent to that point, and see where that line intersects the x-axis. Evaluate that point, and normally it'll be closer to a zero than before. repeat. This, in many cases, will converge much faster than performing bisection.

This is known as the Newton-Raphson Method. Now, in order to draw a tangent line, you have to know how the function is changing at that point, so the line and the function's slopes will match, and well, that means you take the derivative. Since, however, the function which we're trying to find the root of is itself a derivative, this is now a second derivative.

So it turns out Newton-Raphson generalizes upwards in dimension. So when you start off with your error function that you're trying to minimize, you take it's derivative, but now you have to track how it changes in N dimensions, so the derivative object is now a vector.

Now, we're trying to minimize this vector valued function (the gradient), and set it to zero. So we take it's derivative, which, since it's vector valued, will now be a matrix, since each component can vary in N directions. So we now have a NxN matrix that tracks how everything is changing. This second order derivative is called the Hessian. And we can use it the same way (implementation left to the reader ;) ) as we did in one dimension.

Twisol · on June 29, 2018

This is really a fantastic explanation, but I want to clarify one thing from early on for other readers. If the function isn't changing much locally, it's possible that you are at neither a minimum or a maximum, but rather a saddle point. For instance, every function f(x) = x^k for (positive) odd k has a saddle point at x=0. Both the first and second derivative are 0 at saddle points -- another reason we might be interested in the second derivative and its kin.

victor106 · on June 29, 2018

Wow!!! This is very clearly explained. Thank you. You should write a textbook.

TJSomething · on June 29, 2018

Those Wikipedia pages are kind of awful for pedagogy, but they have the right equations, so I won't cover those.

Say we have a curve that corresponds to how good of a fit your model is. We want to try to find the maximum on that curve. However, calculating every point of the curve is too expensive, so we want to minimize the number of points we have to check. So, we start with a guess as to the highest point on the curve and take the first and second derivatives of the curve at that point. This gives us enough to fit a parabola that approximates the curve in the neighborhood of our initial guess point. Then, it's pretty easy to solve for the highest point on the parabola. That's our new guess. Repeat that a few times until the guesses stop shifting much. If the curve is nicely shaped (i.e. smooth everywhere, only has one maximum), the guesses will converge on the highest point.

This is a often faster than a similar method, gradient ascent, which relies upon only taking the first derivative. This would yield a line in the vicinity of our guess, and then we just move our guess a little bit, such that it goes up the line. This is pretty slow, since it can't just go straight to a guess of the top, and if you go too fast, then it'll blow right past the maximum.

The Hessian matrix is the higher dimensional equivalent to the second derivative there and the gradient is the equivalent for the first derivative. For example, if we have a two dimensional surface in 3D, then those matrices will be 2x2 and capture the curvature of the 3D paraboloid in the vicinity of the guess. As you go up in dimensionality, they're called quadric hypersurfaces.

When you're fitting a logistic regression, your hypersurface is the logarithm of the likelihood that the data you have fits the logistic curve with parameters at that point. The logarithm makes the hypersurface better behaved and makes the calculus easier. You just need the gradient and the Hessian, evaluate those at your initial guess, fit a quadric hypersurface to the guess there, pop up to the top of that hypersurface, repeat a few times, and you've got your model.

ghaff · on June 29, 2018

Unfortunately, mathematics is one of the areas where Wikipedia is pretty awful in general. The articles seem mostly written for people who pretty much already understand the topic in question. Of course, you always have to assume some knowledge base but the stereotypical jargon-filled Wilipedia approach is particularly off-putting in this area.

Myrmornis · on June 29, 2018

I understand what you’re trying to say, but Wikipedia is a fantastic resource for mathematics. “Pretty awful” is not a correct choice of words. But yes, much of it is written at beyond-undergrad-math level. And undergrad math is already advanced! And no I’m not someone with a math PhD talking down! I’m struggling through teaching myself undergrad math.

pvg · on June 29, 2018

The only way "pretty awful" is incorrect is that it is too polite and reserved. Reams upon reams of pages are written completely at odds with Wikipedia's own style guidelines and common-sense expectations of what one might find in an encyclopedia. Unlike some famously dense mathematical texts, wikipedia maths pages don't even come with any of the benefits of brevity or focus. It's like a giant joke competition of who can describe every trivial thing in the most abstract and abstruse way except it got out of hand and the participants forgot it was supposed to be a joke. Mathworld and similar sites will help you much more with undergrad maths.

Myrmornis · on July 1, 2018

We’re saying much the same thing, it’s just that I find your and GP’s use of the absolute “pretty awful” to be hyperbolic and something of a loss of perspective. Remember what we have here: a free, actively maintained, accurate, comprehensive and advanced corpus of expository writing on mathematics. Adjectives that are missing there are “intuitition-rich”, and “helpful for undergraduates”. I do understand if you are disappointed with it. As noted, I (undergrad level) don’t approach it with an expectation that it will be my favorite reading on a topic.

pvg · on July 2, 2018

We’re saying much the same thing

No.

something of a loss of perspective.

You'd have to provide some alternative perspective or argument that goes beyond 'pretty awful sounds kinda mean'.

Remember what we have here: a free, actively maintained, accurate, comprehensive and advanced corpus of expository writing on mathematics

We already have a few of those. As I mentioned, mathworld is far better at this and it's been around longer than Wikipedia.

Sinidir · on June 29, 2018

My uni lectures used to be in the same style as wiki articles. Professor reading from a extremely dense abstract script every lecture and not really explaining anything.

Bad times.

cimmanom · on June 29, 2018

Thank you for a super clear explanation that was easy to grok even with no more math background than (extremely rusty) high school calculus.

hotwire · on June 29, 2018

yep, these are the kinds of posts on HN that I really appreciate, not the contrarian "well akshually" kind that usually appear.

mlevental · on June 29, 2018

what is the name of the first method (fit parabolic surface)?

mxwsn · on June 29, 2018

In optimization it's known as Newton's method.

See the third section here for an intuitive image of repeated parabola-fitting. https://ardianumam.wordpress.com/2017/09/27/newtons-method-o...

Wiki: https://en.wikipedia.org/wiki/Newton%27s_method_in_optimizat...

mlevental · on June 29, 2018

i feel silly for never realizing before that this was the appropriate geometric interpretation of newton's method.

dataflow · on June 29, 2018

Are you familiar with what the 2nd derivative of a function is? The Hessian is just that, when you have a function with multiple inputs. What's the second derivative of f(x,y) = x/y? Well there are four of them depending on the order in which you differentiate: f_xx = 0, f_xy = -1/y^2, f_yx = -1/y^2, f_yy = 2x/y^3. You just put these in a nice matrix and call it the Hessian matrix. So for functions of N input variables, there are N^2 second-derivatives, so the Hessian is the matrix of all those second derivatives.

I'm not sure how you saw it used for fitting a logistic model, but in general knowing the concavity of a function is useful for minimizing or maximizing it (e.g. by repeatedly approximating it as a 2nd-order polynomial and minimizing that), so maybe that's what you saw.

thomasahle · on June 29, 2018

> I'm not sure how you saw [the Hessian] used for fitting a logistic model

Probably some 2nd derivative version of gradient descent or Newton's method.

https://en.wikipedia.org/wiki/Newton%27s_method_in_optimizat...

dataflow · on June 29, 2018

Yeah that's what I was referring to when I said "approximating it as a 2nd-order polynomial and minimizing that".

GistNoesis · on June 29, 2018

In a general manner when you have a complex function you want to minimize to obtain your parameters. One way to proceed is to assume some initial parameters and refine them. To refine them, you locally approximate your function, get the minimum of the approximation, take a step toward this minimum and loop until convergence.

At first order you can approximate locally your function as a plane (the plane that go through the point and has the same first derivative), and to minimize that you take a small step (because your approximation is only valid locally) in the direction where the plane is inclined.

Alternatively you can make a better approximation of your function using higher order derivatives. So instead of approximating your function with a plane, you approximate it with a quadratic form (the multidimensional extension of the parabola). This quadratic form matches the first derivatives and also the Hessian (second derivatives in multi-dimension) of your function at your current parameters.

Once you have the approximation, there are closed formula for the minimum of the quadratic form so you can directly jump closer to the result (but how it will perform will depend how close your function resemble your approximation).

When to pick first order approximation or second order approximation is problem dependent, but a quick rule of thumb is second order is faster when close to the solution but consume memory quadratically with respect to the number of dimension so is only applicable when the dimension is low, or when you have problem specific simplifications like your problem being a sum of squares.

In practice interesting problem are too big and first order method is all you can do. But you can also improve things a little by approximating the diagonal of the Hessian, or some low-rank approximation of the Hessian. (This is another tractable kind of approximation of your function). You can also make some probabilistic approximation of your function (only considering a few examples instead of the whole training set) and from that you can derive all "on-line" methods, but this is a story for another day.

imh · on June 29, 2018

Look up maximum likelihood estimation [0] and look up newton's method[1]. Take the log likelihood of your data and find the parameters which maximize it. If that doesn't make sense, read up on numerical optimization.

[0] https://en.wikipedia.org/wiki/Maximum_likelihood_estimation

[1] https://en.wikipedia.org/wiki/Newton%27s_method_in_optimizat...

enriquto · on June 29, 2018

The Hessian matrix is covered in just about any standard calculus book.

The Hessian is a generalization of the second derivative of elementary calculus. Recall that the second derivative of a function f(x) allows to distinguish concave (f''>0) and convex (f''<0) parts of the function. If you are at a local extremum (f'=0) it allows to distinguish maxima and minima.

The Hessian does the same thing, but for functions of several variables. It says in which spatial directions your function is convex or concave. At a local maximum, it is concave in all directions, and at a local minimum it is convex in all directions. At a saddle point, there will be directions where it is concave and directions where it is convex. The eigen decomposition of the hessian allows to find these directions.

It is useful for function approximation because the 2nd degree coefficients of a polynomial that better fits your data are precisely the entries of the Hessian matrix (due to Taylor theorem).

matheist · on June 29, 2018

Random guess: a logistic regression is fit by maximizing a particular log likelihood. [+]

You can find the maximum by (1) gradient ascent, or (2) the analogue of Newton's method in multiple dimensions [++], which involves computing the Hessian matrix.

So there's your Hessian.

[+] https://en.wikipedia.org/wiki/Logistic_regression#Model_fitt...

[++] https://en.wikipedia.org/wiki/Newton%27s_method_in_optimizat...

nerdponx · on June 29, 2018

Good guess!

pX0r · on June 29, 2018

A logistic regression model is typically used for a classification task. 'Fitting a logistic model' entails finding optimal coefficients / 'weights' of input features such that classification error is minimised.

For a binary classification task, one could simply calculate mean squared error between predicted values and actual labels (as in linear regression) and then proceed to find the optimal weights iteratively using gradient descent. But the sigmoid shape of the logistic function makes gradient descent a poor choice of an optimization technique (w.r.t. lack of guarantee of finding a global optimum).

A surer way to find globally optimal weights is using the Newton's method of calculating weight updates. This is a numerical optimization technique that requires one to calculate the 1st and 2nd order derivatives of the error function. The matrix that 'calculates' the 1st order derivative is called a Jacobian and the one that calculates the 2nd order derivative is called a Hessian...

Rainymood · on June 29, 2018

>So my question - is there a layman's explanation for what a Hessian matrix is

(I'm approaching this from a graduate level stats angle). Just like the score vector is the derivative of the ML wrt its parameters, the Hessian matrix is just the derivative of the score vector. It's just the second derivative wrt the parameters of the ML. It's the derivative of the derivative of the ML function.

anilakar · on June 29, 2018

Engineering maths is usually "put your numbers in a matrix and calculate the eigenvalues and eigenvectors" regardless of the field :-)

MarkMMullin · on June 29, 2018

Please make the lowest variance dimension go away to see if you can understand the problem then, and if not, rinse and repeat ?

Koncopd · on June 29, 2018

Absolutely best guide (for me) on these things is http://www.psi.toronto.edu/~andrew/papers/matrix_calculus_fo... Also "Matrix algebra" by Magnus and Abadir.

jonnydubowsky · on June 29, 2018

This thread is a glowing reminder of how effective and friendly the HN community can be, when good will is extended in both the question and the answer. Thank you to all who contributed as the comments have provided me (and many others I'm sure) with a clear and concise map of the matrix calculus full of helpful resources. My 4th of July beach reading is now complete.

cimmanom · on June 29, 2018

The explanatory style here is excellent. They explain clearly in 3 paragraphs what I was unable to get a good handle on in a semester of college math (due to awful teaching styles and textbooks), namely why partial derivatives are calculated the way they are.

bag531 · on June 29, 2018

This is really useful. Matrix calculus was one of the hardest parts of my machine learning course at university simply because it had never been covered by any of the many math classes I'd taken.

andrepd · on June 29, 2018

Too extensive and not focusing enough on the core concepts. Teaching mathematics is about making the underlying structure as clear as possible. It's a very hard thing to do effectively.

qiqitori · on June 29, 2018

This book had a really gentle explanation of the required calculus: https://www.amazon.com/Make-Your-Own-Neural-Network-ebook/dp...

It also builds some really simple Python code to create a simple (non-"deep", i.e. with just one hidden layer) neural network capable of recognizing human-drawn digits with good accuracy.

sdan · on June 29, 2018

Truely amazing work. Hoped for some nice drawings but Latex did it!

calebh · on June 29, 2018

There's a tool now that can do matrix calculus: http://www.matrixcalculus.org/

stablemap · on June 29, 2018

Previous discussion:

https://news.ycombinator.com/item?id=16267178

dang · on June 29, 2018

Hmm that definitely qualifies this one as a dupe, but the discussion is unusually substantive so I guess we'll leave it up.

dm7 · on June 29, 2018

an accessible and concise refreshment of material