An introduction to Machine Learning

antoineaugusti · on Jan 17, 2016

Please note that I'm not the author of the presentation. Made by Quentin de Laroussilhe http://underflow.fr

I had to make a copy to my Google account to keep the slides.

rafaquintanilha · on Jan 17, 2016

Worth to mention that a Statistical Learning Stanford course [1] just started and according to the lecturers there is a lot of overlap in both areas.

[1] https://lagunita.stanford.edu/courses/HumanitiesSciences/Sta...

hartator · on Jan 17, 2016

Does someone know how much it will cost?

nightski · on Jan 17, 2016

It's free. They have run it a couple years in a row now starting in January.

zo1 · on Jan 17, 2016

The linked page says "Free".

gansai · on Jan 17, 2016

thanks!

compactmani · on Jan 17, 2016

If you are just starting out with applied machine learning I would focus heavily on understanding bias and variance as it will really help you succeed. It's I think what (largely) separates the sklearn kiddies from the pros.

johnmarinelli · on Jan 18, 2016

what's wrong with sklearn?

ivan_ah · on Jan 22, 2016

Nothing wrong, quite the opposite scikit-learn is awesome.

I think the comment was a word play on "script kiddie" (ppl w/o real "security chops" but who know enough to run an exploit "script" of some sort).

aabajian · on Jan 17, 2016

This really is a fantastic presentation for newcomers to the field. When I was taking these classes I found it difficult to keep all of the available algorithms organized in my mind. Here's an outline of his presentation:

Overview (5 slides)

General Concepts (9 slides)

K nearest Neighbor (6 slides)

Decision trees (6 slides)

K means (4 slides)

Gradient descent (2 slides)

Linear regression (9 slides)

Perceptron (6 slides)

Principal component analysis (6 slides)

Support vector machine (6 slides)

Bias and variance (4 slides)

Neural networks (6 slides)

Deep learning (15 slides)

I especially like the nonlinear SVM example on slides 57 and 58. It provides a visual of projecting data into a higher dimensional space.

fizixer · on Jan 17, 2016

Thanks for the great slides.

Some questions:

I'm a bit confused trying to understand "error function" vs "loss function" (going from Linear regression to perceptron). Coming from a numerical background:

- Is the term 'error function' used as a special function (like sin, cosine, etc) or is it a generalized term ?

- If it's a special function (the one that looks like MSE [2]), then it's confusing because 'error function' as a fixed/special function is erf [1] also known as Gauss error function (and looks completely different).

- Are we using the term 'loss function' as a generalized term? whose special case is 'error function'? e.g., in linear regression loss function is 'error function' (MSE like function) but in perceptron, loss function is max(0, -xy)?

- Using final output of perceptron for error function makes it a "hard problem" agreed. But what about using just the function from linear regression (the MSE-like one) instead of a using a brand new function max(0, -xy). (It's not very intuitive to reason what's so special about max(0, -xy)).

- Also wondering why do we not use RMSE instead of MSE in linear regression. (But it might have a known explanation in statistics texts, so somewhat off-topic).

[1] https://en.wikipedia.org/wiki/Error_function

[2] https://en.wikipedia.org/wiki/Mean_squared_error

psyklic · on Jan 17, 2016

- You can think of error, loss, and cost functions as the same. In fact, two textbooks in front of me say that the loss function is a measure of error. If "loss" is a confusing word, think of it as the "information loss" of the model -- if your model is not perfect, you lose some of the information inherent in the data.

- There is no particular function used for error and loss. Different functions can be chosen based on the model, problem type, ease of theoretical analysis, etc. In practice, the final loss function is often experimentally determined by whatever yields the best accuracy.

- The perceptron uses a different loss function because it is a binary classifier, not a regressor. In this case, because there are only two classes (1 and -1), the loss function max(0, -xy) is 0 if x and y are the same class and 1 if they are different. Then, the error function just sums these losses together. (Note this is quite similar to MSE.)

- RMSE is also valid -- adding the square root will not affect minimization. MSE is likely more common for minor reasons, such as slightly better efficiency and cleaner theoretical proofs.

underflow · on Jan 17, 2016

"the loss function max(0, -xy) is 0 if x and y are the same class and 1 if they are different"

Not exactly because this would be optimizing the number of correctly classified elements. Instead you minimize the sum of abs(WX) for each misclassified examples.

psyklic · on Jan 17, 2016

In the case of these slides, the loss function is max(0, -xy) and the error function is the sum of these. So, the error function is the number of incorrectly classified examples (if x and y are different, it adds 1 to the error), which is exactly what we hope to minimize.

x=1,y=1 => max(0, -(1 x 1)) = 0.

x=1,y=-1 => max(0, -(-1 x 1)) = 1.

x=-1,y=1 => max(0, -(1 x -1)) = 1.

x=-1,y=-1 => max(0, -(-1 x -1)) = 0.

underflow · on Jan 17, 2016

The transfer function is applied only at evaluation.

In the formulas of the slides (and in the code), for training I compute the loss of an example X and it's expected target as: L(XW, target) What you define is minimizing L(transfer(XW), target) which is not easily optimizable.

psyklic · on Jan 17, 2016

In the case of perceptrons, point taken -- I agree. However, my original statement still holds. The loss and error functions presented on the slides are still valid. Whether or not they are easily optimizable, they are still examples of loss and error functions.

underflow · on Jan 17, 2016

- Is the term 'error function' used as a special function (like sin, cosine, etc) or is it a generalized term ?

For backpropagation I call "error function" a function which takes the training data and the parameters of the model and returns an empirical approximation of how good is the model to be able to differentiate it.

- Are we using the term 'loss function' as a generalized term? whose special case is 'error function'? e.g., in linear regression loss function is 'error function' (MSE like function) but in perceptron, loss function is max(0, -xy)?

I used a squared error loss to define the error function of the linear regression model. This is arbitrary and the choice of the loss depends on the meaning we want to give to prediction errors. Here a big error is penalized way more than a small error.

- Are we using the term 'loss function' as a generalized term? whose special case is 'error function'? e.g., in linear regression loss function is 'error function' (MSE like function) but in perceptron, loss function is max(0, -xy)?

The way I understand the difference between error and loss is by thinking about regularization methods (where you also add a penalty in the error depending on the parameters of the model). In that case error = loss + regularization. When applying gradient descend you would derivate the error and not the loss. The error is an empirical value that is supposed to estimate the performance of the model on unseen data. This is by consequent not formally defined.

- Using final output of perceptron for error function makes it a "hard problem" agreed. But what about using just the function from linear regression (the MSE-like one) instead of a using a brand new function max(0, -xy). (It's not very intuitive to reason what's so special about max(0, -xy)).

The loss function used is a hinge loss. If the point is correctly classified no penalty is added, otherwise the penalty is the distance to the hyperplane.

toxik · on Jan 17, 2016

The reason they call it error function in perceptron learning is that it relates to how the perceptron is taught a correction for an error. Loss functions are more general and usually the word people use when they're talking about optimization problem.s

davmre · on Jan 17, 2016

RMSE and MSE are monotonic transformations of each other, so minimizing one is equivalent to minimizing the other. You can think of linear regression as minimizing RMSE if you like; it's just cleaner to do the math without the extra square root.

yelnatz · on Jan 17, 2016

Pretty good summary of what you learn in your first machine learning class in college.

lectrick · on Jan 17, 2016

Is there an online course for this I could take?

vthommeret · on Jan 17, 2016

Andrew Ng's popular Machine Learning course goes over most of the topics in the slides: https://www.coursera.org/learn/machine-learning

Linear and logistic regression, gradient descent, clustering, support vector machines, bias and variance (one of the slides was taken from the course), neural networks, etc...

lectrick · on Jan 19, 2016

You know... I believe I started this class once and didn't finish it due to time constraints. I think it's time to try again...

tgokh · on Jan 17, 2016

This "Statistical Learning" course has just started on Stanford's online platform last week:

https://lagunita.stanford.edu/courses/HumanitiesSciences/Sta...

kendallpark · on Jan 17, 2016

I second this question. I couldn't find one on coursera or academic earth.

reverius42 · on Jan 17, 2016

I liked the UW Coursera class that gave a broad overview of these topics with some applications: https://www.coursera.org/learn/ml-foundations

It's part of a Machine Learning Specialization on Coursera (5 courses + a capstone project) which goes deeper on some areas after the foundations course: https://www.coursera.org/specializations/machine-learning

I am taking this specialization and I have learned a lot so far. The material seems like it's at exactly the right level of depth (balances giving a high level overview of the field, with enough depth in specific areas to understand how things work and be able to apply them). Disclaimer: I work at Dato, and the CEO of Dato is also one of the instructors of this course.

fnl · on Jan 18, 2016

Nobody concerned about plagiarism here? I am pretty sure I've seen a number of the slides and graphics elsewhere. Correct attributions however seem amiss.

underflow · on Jan 19, 2016

I did those slides for a talk at school at the very last minute and I did not expect it to be republished. I requested the edit rights on the document and I'll fix this asap.

kendallpark · on Jan 17, 2016

Yes, thank you. I'm hoping to build an ANN this summer and don't have the luxury of taking an actual class.

Does anyone have any other resources?

bl4ckdu5t · on Jan 17, 2016

You should look up marI/O on YouTube it may be a good starting point for you

kendallpark · on Jan 17, 2016

Thanks! I'll checkout it out.

aerioux · on Jan 17, 2016

that was a really good introduction :) sort of like an executive summary - all the "why we care" and some of the words you might want to look at to actually learn the details

max_ · on Jan 17, 2016

Thanx for sharing this!!

Dowwie · on Jan 17, 2016

is there a corresponding video where the slides are presented?

antoineaugusti · on Jan 17, 2016

Nope, sorry. This presentation was given by Quentin de Laroussilhe in Paris at EPITA recently.

remriel · on Jan 17, 2016

Thank you.