Having read significant chunks of both ESL and Understanding Machine Learning (albeit UML much more recently) I would argue that for many readers UML is superior.
ESL pays short shrift to the computational complexity of learning whereas UML explicitly handles both statistical and computational complexity concerns. It doesnt matter how statistically pure your algorithm is if its running time scales exponentially with your data.
All of UML's chapters are conceptually unified even when discussing different ML algorithms, with ESL being more of a grab-bag by chapter.
I am a graduate student at MIT, and can second this recommendation. It is a fantastic book for machine learning and nothing else I have seen comes close.
Just a curiosity: One of the authors also proposed Pegasos SVM [1] which is a nice online approximation to SVM and that can be written in 15 lines of code or so.
I feel like the barrier to machine learning for me, as I've seen in many tutorials and books and is an immediate discouragement, is the massive amount of math thrown in your face. Many of us didn't just graduate, need glasses and fall asleep at 8pm on the couch when the kids are in bed... Math is this distant fragment of memory buried under years of everything not Math.
It feels like machine learning is only taught by academia but the majority of the audience is for practical use by the average developer wanting to play with it today.
I interview a lot of developers for ML positions at our company. The first red flag is always a lack of math. Candidates who come in with API-level competence ie. can implement an ML algo using this,that or the other API, without any understanding of some basic math behind it, always fare poorly. Atleast in ML, not having an understanding of math is pretty much like claiming expertise in riding a bicycle by watching 100s of bicycle videos on youtube without ever riding a bicycle yourself. This isn't the case with say, front-end web dev, where nobody really cares whether what's under the hood is php or jquery or elegantly handcrafted elm as long as the webpage looks good and functions as advertised. You can get a lot of mileage by hiring someone who can "make webpages" without knowing what tools they use to make that webpage. Once you get traction etc. you'll probably rewrite all of the cruft :) With ML, if you hire somebody who can "create decision tree in python mllib" without knowing the first thing about what entropy is & how to compute entropy ( a real candidate, unfortunately ), you are simply inflicting a lot of pain on yourself and your customers. Suppose such a person is deciding whether or not to give you a loan, and he decides to construct a decision tree. He'll happily throw in your zip code and credit card number into the mix, not realizing that those two features have super high entropy but the tree will have serious overfitting issues ie. the tree will simply not generalize to unseen data. He won't realize these things because he won't know what entropy is in the first place, since he only thinks of decision trees in terms of some black box that comes out of some ML api.
A lack of mathematical intuition is a serious problem for many people from engineering to biology to economics. It certainly plagued me throughout my engineering bachelors studies and is something I continually work to get better at.
In my opinion, physics students learn the best framework for thinking and get a very good mathematical intuition. For example, here's a problem from an introductory QM book that really threw me for a loop when I was studying:
A needle of length L is dropped at random onto a sheet of paper ruled with parallel lines a distance L apart. What is the probability that the needle will cross a line?
Since the lines are parallel you can rephrase the problem:
A circle of radius L is centered x far away from a border 0<x<L. This is because one end of the needle will always end in some zone (the center) and the other end will be L far away (the circle).
How much of the 2pi boundary is outside the zone?
When x -> 0 then it's going to be 50% since one boundary line becomes a tangent and the other goes through the middle. When we move x by k (e.g. f(x+k)) then 2k new points will be added on the left side while 2*k points leave the boundary on the right side. When x=L/2 then the boundary lines will split the circle in four equal parts (since they're tangent to the radius at r/2 on both sides) so intuitively its 50%.
> A lack of mathematical intuition is a serious problem for many people from engineering to biology to economics.
This is true. I can't really say why, but after my Discrete Mathematics class a lot of my Computer Science problems became a lot easier to reason about.
I will say that after my discrete math class we really only talked about how to write a proof, and well, it really didn't help for me. (I think we were supposed to get further into stuff, but well, the class wasn't paced well, new professor, etc).
For such a problem, usually
"at random" means a uniform
distribution. But on
the plane, there is no
uniform distribution.
So, the "paper" can't be
the plane. So, it might
be fair to ask the size
of the paper and what
happens with the needle
near the edges? E.g., on
a rectangular sheet of paper
of finite size, the needle
can land in a position so that
it does not cross a line
but would on a larger sheet
of paper.
> A needle of length L is dropped at random onto a sheet of paper ruled with parallel lines a distance L apart. What is the probability that the needle will cross a line?
Thickness of line is needed right? Otherwise P approaches 100% as thickness approaches 0?
The lines are infinitely thin. Equivalently you can imagine the paper is divided into regions of width L, and the question is whether the needle will cross a region boundary (https://en.wikipedia.org/wiki/Buffon's_needle).
I don't think that page explains it very well, but have poor math background. I imagined notebook paper with horizontal lines spaced L apart and then the needle dropping at any angle. When the needle is vertical the probability it cross a line is 1, when horizontal it is zero. The length of the needle L is the hypotenuse of a triangle. If we call the angle from horizontal x, the "height" of the needle can be anywhere within h=Lsin(x) for x between 0 and pi/2.
The "lines" are like a sample of a point from a uniform distribution U with width L, and h is an interval inside U. The probability a number sampled from a distribution of width L will fall within interval h is h/L. Substituting for h gives p(cross|x) = sin(x).
Then assuming the needle is equally likely to drop at any angle, for any one angle theta we get probability density p(theta=x) = 1/(pi/2-0)= 2/pi.
The probability the needle drops at angle x AND crosses a line is the product of p(theta=x)p(cross|x)= (2/pi)sin(x). As mentioned, x can range between 0 and pi/2. To get the probability the needle drops at angle x1 OR x2 OR x3, etc and cross we need to sum all these. So take the integral of (2/pi)sin(x) from 0:pi/2. This gives 2/pi.
Not to detract from your point that math is important, but in that example proper methodology (e.g., cross-validation), proper feature engineering, and especially domain knowledge are probably even more important. You can be aware of the strengths & weaknesses of different machine learning algorithms without being intimately familiar with the math. Ideally, ML methods are not treated as a "black boxes", but some aspects are inherently black box, even if you do know the math (e.g., parameter tuning).
>> He'll happily throw in your zip code and credit card number into the mix, not realizing that those two features have super high entropy
You will be surprised: you can make significant gains by including the zip code - i've seen that happen in a competitive setting. Where you live probably contains some signal about your credit worthiness.
Having said that, of course it doesn't make sense to simply feed the raw zip code to the tree. An appropriate encoding (most people would use a one-hot encoding, though there exist better ones) of the zip code will be key to extracting signal in a robust way.
>> ... the tree will have serious overfitting issues
Isn't it an almost standard practise now to use an ensemble of trees, such as a random forest? Decision trees have long been known to be prone to overfitting.
I get that math is an important aspect, but being able to realise that throwing in those variables would create entropy sounds like more of a common-sense thing, no?
I guess I should still brush up on math though, it seems.
I can skip the math part - we don't all invent new algorithms - but what I really need is a large enough & gradual set of problems to solve (datasets + verification scripts). I mean, start from the simplest and teach people how to use the already available software. Machine Learning should be assimilated practically, too much theory with too little application is useless. Most of us should focus on using existing software efficiently instead of being able to implement backprop.
Kaggle's "Titanic: Machine Learning from Disaster" [0] is a great place to start. If you're anxious to dive in, go straight to "Getting Started with Excel" [1] and move on to "Getting Started with Python" [2]
Note: I'd strongly recommend the Anaconda Python distribution [3], as it has pretty much everything you need. Also, for immediate feedback on what you're doing with Python, I've fallen in love with Jupyter Notebooks (formerly IPython Notebooks) [4], which you'll have as part of the Anaconda distribution, along with all the other popular Python packages for scientific work.
Machine learning is fundamentally mathematical, so you can't expect to completely avoid it without remembering or learning at least a little bit of the maths. Trying to avoid it all won't do you any favors, it will just mean you can't understand what is going on.
On the other hand, the math you absolutely need to follow along is pretty straightforward, so you can hopefully find tutorials that emphasize the applications and graphic examples of what is going on.
Ultimately, ML is a mathematical discipline. You can ask for a gentle approach that gets you to the foot of the mountain, but "if you want to learn about nature, to appreciate nature, it is necessary to understand the language that she speaks in." If you want to be more than an amateur, there's not much substitute for getting comfortable with math at the level of, say, Kevin Murphy's book.
The good news is that the required math is fairly elementary - calculus, linear algebra, probability and statistics, all freshmen or maybe sophomore-level topics - so it shouldn't be beyond reach of a motivated developer able to set aside some time to learn. MOOCs and organizing study groups with friends/co-workers can help a lot here as well.
I'm currently taking Andrew Ng's Coursera course and I'd agree it's quite accessible. In fact, if you have a solid understanding of calculus and linear algebra, you might find it a bit slow at times.
For people who are disappointed by the shallowness of it, I recommend supplementing it with the notes to his Stanford class: http://cs229.stanford.edu/materials.html. The combination worked well for me.
This was my problem; it was incredibly boring (at least for the first few classes) and the video-based nature of it meant that I had to skip around and try to pick up what was going on later, which seldom worked. I hear it's more self-paced now, so that might be a better option these days.
so it shouldn't be beyond reach of a motivated developer able to set aside some time to learn. MOOCs and organizing study groups with friends/co-workers can help a lot here as well.
I'm in the middle of that process right now. I only took Calc I in college and that was 20 years ago, so I have decided to work my way through a Calc sequence, Differential Equations, Linear Algebra, and Probability and Statistics through a combination of MOOCs, "X For Dummies" books, Youtube videos (hello, Gilbert Strang!), Schaum's Outlines books, Khan Academy, a mammoth stack of college maths texts that I've picked up at used book stores, and questions on stats.stackexchange.com, math.stackexchange.com, learnmath.reddit.com, etc.
I'm doing the Ohio State MOOC on Calc I now on Coursera, and accompanying that with the Gilbert Strang "Highlights of Calculus" video series[1]. So far so good. I definitely think this stuff is learnable if one is willing to put in the time and work, even without going back to taking "on campus" classes at a university.
I'm a high school Senior (17) and I'm currently taking it. It is ridiculously understandable and I often see myself yearning for more. But I've also had Calc 3 + Linear Algebra by now, so its understandable that not everyone would get it. The intuition is simple, however.
As others have said, you need to at least get the gist of what's going on mathematically to be able to make sensible decisions about model selection, structure, etc.
That said, I find a lot of the introductions to the theory behind ML techniques to be very poorly written. It's often worth giving a new student a conceptual simplification before introducing a rigorous definition.
Without linear algebra and basic probability/calculus though, forget it. Luckily there's great sources to brush up on it.
Certainly understanding the math is very important but it is harder to get expertise on the pre-requisite math because the horizon is much bigger. I would recommend taking a case study approach and side by side learning the math stuff needed. If you are looking for an example then take a look at this,
https://www.coursera.org/learn/ml-foundations/
I've read this book and warmly recommend it. It has a very pragmatic "no bullshit" approach and it's very mathematical and concise.
The neural networks chapter is tiny (but that's ok - that's not the focus) and some of the questions are really hard - but overall I've really enjoyed it.
It's pretty cool that the book is not only free, but they link to courses that uses it.
If you speak Hebrew you can get two different professors take on how to teach the material in the book, as well as lecture notes from a total of 3 professors. That's pretty neat if there's a concept you are struggling with as a student!
Glad to see another book to learn from for free. But my problem now is that there are so many books each with somewhat different approach and content for the same ML techniques. Not necessarily bad, but I get somewhat confused when trying to apply a method.
EDIT: I guess the focus on the theory might help me.
My question for someone that has an intermediate level of skill in machine learning, what's the best way to dip your toes in? (Udacity, coursera, edx, PDFs, Talking Machines podcast, etc)
Find a problem to work on in a domain you find interesting. By reading published papers and trying to attack the problem, you'll be forced to pick up a lot of other knowledge not commonly discussed like: feature extraction and selection, dimensionality reduction, dealing with sparsity, common metrics for that problem, recent work, etc.
I was forced to learn a massive amount in a short period of time for work, but I'd previously watched Andrew Ng's lectures, as well as majored in Math/CS. I can also generally recommend Hinton's NN lectures, Socher's Deep learning for NLP, Andrew Ng's Machine Learning, and a few books.
"The Elements of Statistical Learning" (https://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLI...) is far and away the best book I've seen.
It took me hundreds of hours to get through it, but if you're looking to understand things at a pretty deep level, I'd say it's well-worth it.
Even if you stop at chapter 3, you'll still know more than most people, and you'll have a great foundation.
Hope this helps!