If you know basic undergrad probability/statistics/linear algebra (matrices, vectors, eigenvalues) you can do CMU's graduate course in ML that is intended to prepare PhD students to understand research papers in the field
CMU also has an 'Applied Machine Learning' undergrad course that is paywalled fully unfortunately, but they use the text: Witten, I. H. & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, second edition
And what about after that? In my experience at least, I can highly recommend Stanford's EE364a Convex Optimization. http://web.stanford.edu/class/ee364a/
When I was an undergrad at Berkeley, one particularly well known research lab would only look at your resume if you took that course online. Warning: that class is not for the faint of heart. And be good at linear algebra.
None of these classes are for the feint of heart if you're actually enrolled in them and operating under crazy deadlines with thousands of dollars on the line. This is clearly for our casual interest and hacking around with ML research hobby.
is gorgeous material. Large parts it are from theorems of the alternative, linear programming, Jensen's inequality, R. T. Rockafellar, etc., but Boyd makes it all a clean whole.
That's the upside! The downside is that, IMHO, rarely does any one person need more than a small part of the whole book, and the parts that someone needs are likely also available in the subject they are working with.
So, at least download Boyd's book and use it for reference: If ever have a question about convexity, then start with Boyd!
The title of this page feels like a reference to the excellent Mohri et al. Foundations of Statistical Machine Learning. I’d recommend both it and Shai Shalev-Schwartz for VC theory/Rademacher complexity sorts of statistical ML.
However, I’m not crazy about this summary. It’s less foundational than it is a survey. I probably dislike the format and formatting more than anything, but I would not recommend this over other resources.
This course is complementary to Mohri's excellent book and course. Many students at NYU take both courses, in either order (https://davidrosenberg.github.io/ml2018/ and https://cs.nyu.edu/~mohri/ml17/). Mohri's course builds a foundation for proving performance guarantees (yes, using tools such as VCdim and Rademacher complexity). This course tries to be practical, but not superficial. We do a careful study of multiple examples of the four fundamental components of an ML method: loss function, regularization, hypothesis space, and optimization method. (In a probabilistic setting, regularization becomes a prior and loss becomes a likelihood.) Framed in this way, it's usually much easier to understand or invent new methods. And within this framework, we absolutely try to survey as many methods as we have time to look at carefully.
Hi David! Thank you for your thorough response. I appreciate it. And I understand why one would choose to cover as much breadth as you do. I don’t think there’s an easy way to cover all the basics without feeling like a survey, but keeping track of core concepts across applications was what helped me understand how they were connected.
Yeah, this space feels pretty saturated already, and people seem to use the same topics / presentation / ordering over and over. What a ton of duplicate effort!
It'd be nice to see a course try to put its own spin on "machine learning" and 1) present standard topics in an unusual way and 2) include topics that intro ML students might not normally see.
Take a peek at our class [0]! We present a bunch of topics not covered in undergraduate courses (at least to our knowledge), such as proximal gradient descent, random features, etc., without referencing any probability. The course is aimed at mostly Sophomores/Juniors in any STEM field with the only prerequisite being the introductory linear algebra course EE103. [1] All of the material (minus solutions) is available online for both courses.
I'm probably highly biased, but I'd like to say this is a fairly fresh take which departs heavily from the usual CS229 (Ng's course) presentation style and order since it's meant for a completely different audience (and was, to be fair, written this past quarter, unlike 229 which was written perhaps 15 years ago).
I used Stanford’s materials for CS229 as a reference in two of the three graduate ML classes I’ve taken. I’ve been quite impressed, and I don’t doubt ee103 is also quite well done.
There are problems, but I’m afraid that we took them down since we need to rewrite a few before next year. Sorry! I’m sure poking around the site might yield some results, if you’re careful enough ;) (at least for 104, someone else is teaching 103 and I’m not sure what all they’re doing with the course, but all of the main problems are available on the book’s website).
Here are some of the things that I think are distinctive about the class (although certainly all of these are taught in some other class somewhere): discussion of approximation error, estimation error, and optimization error, rather than the more vague “bias / variance” trade off (common in more theoretical classes, such as Mohri's, but not common in practically-oriented classes); full treatment of gradient boosting, one of the most successful ML algorithms in use today (most courses stop with AdaBoost, which is rarely used for reasons discussed in the course); much more emphasis on conditional probability modeling than is typical (you give me an input, I give you a probability distribution over outcomes — useful for anomaly detection and prediction intervals, among other things), explanation (including pictures) for what happens with ridge, lasso, and elastic net in the [very common in practice] case of correlated features (I haven't seen this in another class, though I'm sure others have done it, somewhere); guided derivation (in homework) of when the penalty forms and constraint forms of regularization are equivalent, which gives you much more flexibility in how you formulate your problem (I know this is uncommon because it took me a very long time to find a reference), multiclass classification as an introduction to structure prediction (idea taken from Shalev-Shwartz and Ben-David's book). On the homework side, you’d code neural networks in a computation graph framework written from scratch in numpy; well, we're pretty relentless about implementing almost every major ML method we discuss from scratch in the homework.
I looked over the whole list of topics. They have a page of questions about prerequisites, and for all the questions I was able to answer that I was fully comfortable with the question. But the questions have to do with probability and optimization, and my applied math Ph.D. was in those fields with research in stochastic optimal control.
I can begin to see some of why Bloomberg is interested in the material: Maybe with all the economic and stock market data they collect, they can build some surprisingly useful predictive machine learning models. Okay. Maybe one good result will be that Michael Bloomberg will give some more money to Johns Hopkins!
But otherwise I was disappointed:
(1) Put simply and bluntly, it's all about just one now very old topic -- regression analysis. Well, when old regression analysis doesn't fit very well, then try logistic regression, ridge regression, regression trees, other forms of regression, other forms of curve fitting, e.g., neural networks.
Or, look, guys, it's all just empirical curve fitting. Ptolemy tried empirical curve fitting. He used his epicycles. They didn't fit very well. Then, later, from work of Kepler, a falling apple, etc., Newton guessed that (A) there was a law of motion due to force equals mass time acceleration and (B) there was a force directly proportional to the product of two masses and inversely proportional the the square of the distance between them. That fit the data great!
Lesson: Lots of things change continuously. Usually the changes are differentiable, that is, have a well defined tangent. Well, that tangent is linear. So at least locally, lots of things are linear. So, for lots of things, a promising first cut approach is to do a linear fit. For more, neural networks can approximate anything continuous, and lots of things are continuous. So, net, curve fitting has some promise and utility. But, still, curve fitting is just guessing without any real basis in science, e.g., couldn't find or replace what Newton did.
Really, the now classic texts in regression analysis start with something the machine learning curve fitting does not: The classic texts assume that there really is a linear equation, that we would have the equation exactly except for some errors in the data, and then do some nice applied math to show how to get the errors down and get a good approximation to the equation that has already been assumed to exist. The assumptions can vary, stronger or weaker, but in the theory there was little or no role for just empirical fitting. Well, machine learning is charging ahead without that assumption that a linear equation exists. And, wonder of wonders, apparently often now such an equation, even as a good approximation, doesn't exist. So, we are back to empirical curve fitting, struggling like Ptolemy. We already know: Some successes are possible, but like Ptolemy we face some severe limitations.
(2) Okay, when simple regression doesn't fit very well, we keep trying? Okay. Say, we try logistic regression, ridge regression, L1 or L2 regularization regression, regression trees, boosting, ..., neural networks, etc. Uh, which is better, L1 regularization or L2 regularization? Uh, this sounds like throwing stuff against the wall until something appears to stick. Sure, that can work at times, but are we really satisfied with that? Wouldn't we want some more solid reasons for the tool we pick? Lots of places elsewhere in applied math, applied probability, and applied statistics we do have solid reasons.
With so many efforts to patch up regression, maybe we might suspect that for a lot of problems regression is not the right tool?
(3) There is a lot more to applied statistics, applied probability, etc. and possibly of value for applications, maybe including Bloomberg's customers, than empirical curve fitting. Commonly this work has equal justification to be called machine learning because it also takes in data, estimates some parameters, builds a model, and gives results. Moreover, the better cases work with clear assumptions so that when the assumptions hold we know we have something solid, and some of the methods use meager assumptions that are relatively realistic in practice. The Bloomberg course is just empirical curve fitting, nearly all versions of regression analysis, and omits all the rest. On the applied math shelves of the research libraries, regression is only a tiny fraction of the whole. Where's the rest?
Nice comment, echoes a lot of my feelings about ML. I have a question. You write
> (2) Okay, when simple regression doesn't fit very well, we keep trying? Okay. Say, we try logistic regression, ridge regression, L1 or L2 regularization regression, regression trees, boosting, ..., neural networks, etc. Uh, which is better, L1 regularization or L2 regularization? Uh, this sounds like throwing stuff against the wall until something appears to stick. Sure, that can work at times, but are we really satisfied with that? Wouldn't we want some more solid reasons for the tool we pick? Lots of places elsewhere in applied math, applied probability, and applied statistics we do have solid reasons.
I'm wondering what these "solid reasons" could be? Some sort of experience based on past data? An example would be helpful.
Let's see: Pick a system that we know to be linear. Get a lot of pairs of real inputs and outputs and find the coefficients of the linear function that relates the inputs to the outputs. Then given a new input, can say what the output would be.
Okay, the function, system between a violin on the stage at Carnegie Hall and a seat near the roof is linear. So, I'm typing quickly here, do regression with a recording at the stage and at the seat and look for the coefficients in the convolution. Then given an oboe, can say what it will sound like in the seat.
Hooks law with small deflections is linear. So, take independent variables the forces on a space frame and the dependent variables the deflections and estimate all the spring stiffness values.
Take 200 recipes for tomato sauce all from the same 10 ingredients, for each recipe measure the weight of each ingredient and the weight of the protein in the final sauce and estimate the protein in each of the 10 ingredients. Then for any new tomato sauce recipe, weigh the ingredients and get the protein in the final sauce.
> I'm wondering what these "solid reasons" could be?
For the classic assumptions for regression, we know that a linear function exists, our data has additive errors that have mean zero, constant variance, and are Gaussian.
Then we can get an F ratio test on the regression, t-tests on the coefficients, and confidence intervals on the predicted values. This is just the standard stuff in the classic derivations in several of the books I listed.
And with some weaker assumptions, there are still some such results we can get.
This is just for regression.
There's a lot more in applied statistics, that is, where we make some assumptions that seem accurate enough in practice, get some theorems, and benefit from the conclusions of the theorems.
For an example, also in this tread I wrote about estimating how long some submarines would last. The assumptions were explicit and, on a nice day, somewhat credible. If swallow the assumptions, then have to take the conclusions seriously.
I've done other problems in applied statistics -- have a problem, collect some data, see what assumptions can make, from the assumptions, have some theorems, from the theorems have some conclusions powerful for the problem. If swallow the assumptions, trust the software, ..., then have to take the conclusions seriously. So, get to check the software and argue about the assumptions.
For machine learning as in the Bloomberg course, have some training data and some test, validation data. Fit to the training data. Check with the test data. Assume that the real world situation doesn't change, apply the model, and smile on the way to the bank.
I guess, okay when it works. But:
(A) how many valuable real applications are there? I.e., regression has been around with solid software for decades, and I've yet to see the yachts of the regression practitioners -- maybe there are some now but what is new might mostly be hype. Or the problem was that the SAS founders were not very good at sales? Same for SPSS (now pushed by IBM), Mathlab, R, etc.?
(B) Wouldn't we also want confidence intervals on the predictions? Okay, maybe there are some resampling/bootstrap ways to get those, maybe.
(C) There's more to applied statistics than empirical curve fitting, and commonly there we can have "solid reasons". For more on applied statistics, what I've done is only a drop in the bucket, ocean, but there research libraries are awash in examples.
Graycat and I have a history of conversations on HN regarding the differences between statistics and machine learning approaches.
ML is a term that came from Russian probabilists and statisticians. ML is applied probability, optimization and algorithms. A traditional statistician will be strong in the former but not necessarily in the other two.
One can of course take the 'what sticks to the wall approach' but it would be a mischaracterization to state that a principled approach does not exist in ML literature. There are theorems stating when to try one over the other and then what to expect.
The difference in stats and ML can be seen in the nature of the theorems in their body of literature.
Lets take the probabilistic setting of ML, because this matches with the assumption a sophisticated statistician would make. ML has techniques where you don't need a stochastic model, those settings are similar to adversarial game theoretic fundamentals that does not involve a stochastic sampling process.
The assumptions:
We have a joint distribution over random variables X,Y and have drawn a sample of size n from it. We have another sample from the same joint distribution but the Ys have been removed and we want to get those back.
The baby statistician's way:
God came to me in my dream and told me every thing about a function f(X, \theta) that would recover the ys. He conveyed infinite amount of information but left out a finite number of parameters (settings of knobs) for me to figure out. I was told that the specific \theta lives in the set \Theta.
The mature statistician's way:
Almost the same as above except that \theta would be living in an infinite dimensional space. In other words he can work with a less powerful god who leaves out infinite amount of information.
The theorems: A statistician would be interested in theorems where they show that their method manages to recover the \theta that god left out.
In other words they would like to find an estimator \hat{\theta} that for an infinitely sized initial sample converges to the true \theta. The crowning accomplishments are when these theorems prove that this happens with probability 1. Weaker ones are those that state that this happens with as high a probability one wants (but never quite reaching 1).
ML way:
A machine learner is not interested in theorems that characterize how close \hat{\theta} is to true \theta. Furthermore asymptotically we are all dead, so he is interested in finite sample results. His point would be that all this god mumbo-jumbo and \theta is a piece of fiction we invented anyway. No one will ever be able see these, even in principle, so he couldn't care less about those. He would care about theorems that show how the function value (and not its parameters) converges uniformly to the best function in a class of functions \cal F_n of his choosing and using finitely many samples. He would also be interested in finding a sequence of function classes so that one can consider richer and richer supersets of function classes as he get more and more data. The regularization business falls out of the function classes he wants to look at. A closely related statistical viewpoint would be those of sieves.
This is not an entirely correct characterization because there existed (and exists) probabilist and statisticians who were interested in prediction theorems rather than parameter recovery theorems, but they were far far fewer in number. An example would be Prof. Dawid.
Nice comment, " the now classic texts in regression analysis start with something the machine learning curve fitting does not...". What classic texts would you recommend starting with?
A relatively good, introductory text on
statistics, starting without either
calculus or linear algebra, with nice
progress into experimental design and
analysis of variance, that is,
multivariate statistics with discrete
data, from some serious experts in that
field, from University of Iowa, for some
serious reasons, how to maximize corn
yields considering soil chemistry, water,
seed variety, plowing techniques,
fertilizer, etc., long commonly used for
undergraduates in the social sciences,
e.g., educational statistics, agriculture,
etc. is (with my mark-up for TeX):
George W.\ Snedecor and William G.\
Cochran, {\it Statistical Methods, Sixth
Edition.\/}
A good, first, no toy, text on regression,
with minimal prerequisites:
N.\ R.\ Draper and H.\ Smith, {\it Applied
Regression Analysis.\/}
Some good books on regression and its
usual generalizations, e.g., IIRC, factor
analysis, i.e., principle components,
discriminate analysis, etc.:
Maurice M.\ Tatsuoka, {\it Multivariate
Analysis: Techniques for Educational and
Psychological Research.\/}
Donald F.\ Morrison, {\it Multivariate
Statistical Methods: Second Edition.\/}
William W.\ Cooley and Paul R.\ Lohnes,
{\it Multivariate Data Analysis.\/}
A mathematically relatively serious text
on regression:
C.\ Radhakrishna Rao, {\it Linear
Statistical Inference and Its
Applications:\ \ Second Edition.\/}
For multivariate statistics with discrete
data, consider
George W.\ Snedecor and William G.\
Cochran, {\it Statistical Methods, Sixth
Edition.\/}
Stephen E.\ Fienberg, {\it The Analysis of
Cross-Classified Data.\/}
Yvonne M.\ M.\ Bishop, Stephen E.\
Fienberg, Paul W.\ Holland, {\it Discrete
Multivariate Analysis:\ \ Theory and
Practice.\/}
Shelby J.\ Haberman, {\it Analysis of
Qualitative Data, Volume 2, New
Developments.\/}
The classic on the math of analysis of
variance:
Henry Scheff\'e, {\it Analysis of
Variance.\/}
Broadly, in all of this, we are trying to
analyze data on, call it, several
variables, make predictions, etc.
In all of this, and as in the title
{\it The Analysis of Cross-Classified
Data.\/}
above, suppose we have in mind random
variables Y and X.
What is a random variable? Go outside.
Measure something. Call that the value of
random variable Y. What you measured was
one value of possibly many that you might
have measured. Considering all those
possible values, there is a cumulative
distribution F_Y(y) that for any real
number y we have the probability that
random variable Y is <= real number y
P(Y <= y) = F_Y(y).
So, F_Y(y) is defined for all real numbers
y, is at 0 at the limit of y at minus
infinity and at 1 at the limit of y at
plus infinity. So, as we move real number
y from left to right, F_Y(y) increases --
monotonically. On a nice day, function
F_Y is differentiable, and with the
derivative from calculus
f_Y(y) = d/dy F_Y(y)
and is the probability density of real
random variable Y.
Here's the standard way to discover
something about Y, in particular about its
cumulative distribution F_Y:
We can imagine having random variables
Y_1, Y_2, ... that are, in the sense of
probability, independent of Y and that
have the same cumulative distribution as
Y. Then for positive integer n, and for
real number y, by the law of large numbers
(the weak version has an easy proof), in
the limit as n grows large, as accurately
as we please, the fraction of the values
Y_1, Y_2, ...,Y_n
that are <= y
is F_Y(y). So, via such simple random
sampling, for any real number y we can
estimate F_Y(y), the cumulative
distribution of Y.
For a little more, under meager
assumptions that hold nearly universally
in practice, if we take the ordinary grade
school average of
Y_1, Y_2, ..., Y_n
as n increases we will approximate the
average or expected value of Y denoted
by E[Y].
To define the expected value, we can use
some calculus and the cumulative
distribution of F_Y, but for now let's
just use our intuition about averages and
move along.
Now suppose we are also given random
variable X. Maybe the values of X are just
real numbers, some 10 real numbers, 20
values, from set {1, 2, 3}, the last three
weeks 100 times a second of the NYSE price
of Microsoft, or full details on the
atmosphere of earth every microsecond for
the past 5 billion years. That is, for
the values of X we can accept a lot of
generality. Still more generality is
possible, but that would take us on a
detour for a while.
For our point here, let's assume that X
takes on just discrete values or we have
just rounded off the values and forced
them to be discrete. In practice we will
have only finitely many discrete values.
Now we want to use X to predict Y.
So, sure, much of machine learning is to
construct a model, maybe with regression
trees or neural networks, to make this
prediction, but here we will show a
simpler way that is always the most
accurate possible whenever we have enough
data. How 'bout that!
This simpler way is just old cross
tabulation.
Net, over a wide range of real cases
trying to predict Y from X, we should just
use cross tabulation unless we don't have
enough data. Or, the main reason for just
empirical curve fitting using regression
linear models or neural network continuous
models is that we don't have enough data
just to use cross tabulation.
For a preview of a coming attraction, will
notice in nearly all of regression and
neural networks big concerns about over
fitting. Well, cross tabulation doesn't
have that problem. How 'bout that!
Errata: "We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent of Y and that have the same cumulative distribution as Y."
should read
"We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent and that have the same cumulative distribution as Y."
"To define the expected value, we can use some calculus and the cumulative distribution of F_Y, but for now let's just use our intuition about averages and move along."
should read
"To define the expected value, we can use some calculus and the cumulative distribution F_Y, but for now let's just use our intuition about averages and move along."
Let real number x be some value of random
variable X. Suppose we take the cumulative
distribution of Y when X = x (X is a
random variable that might take on many
different values; x is just some real
number). E.g., maybe when X = x = 3 (X is
a random variable; x is just the real
number 3) we find the conditional
probability that Y <= y given that X = x,
that is, we find P(Y <= y|X = x) =
F_{Y|X}(y) (Y is a random variable; y is
just some real number; Y|X is a
subscript).
Or we imagine a thick chocolate cake with
one edge the Y axis and another edge the X
axis. The surface of the cake is the
joint probability density of random
variables X and Y. At X = x, we cut the
cake parallel to the Y axis and
perpendicular to the X axis. We look at
the cut surface -- that's essentially the
conditional probability density of Y
given X = x.
Note: In an advanced course in
probability, we can show that it makes
sense to consider the density at
individual values of x one at a time --
the key to doing so is the subject
measure theory used as the foundation
for probability theory as in the 1933
paper of A. Kolmogorov.
Well, with that cut surface, we take its
area and divide by it so that we have
scaled the cut area to have area 1 so that
the cut surface is a probability density
-- we are being really intuitive here but
essentially correct.
We now have the conditional probability
density of random variable Y given random
variable X when X = x for real number x.
Or, suppose Y is height and X is weight.
With X = 110, we get the probability
density distribution of height for people
who weigh 110.
So, we are on the way to using weight to
predict height -- no joke, this is close
to fully seriously the most accurate way
to proceed in the context.
If X is not just a number but two numbers,
weight and gender, say, 0 for female and 1
for male, then, with weight 110 and gender
female, we get the distribution of height
for 110 pound females. If X has three
components, with the third 1 for
cheerleaders and 0 otherwise, then we can
predict height given weight of female
cheerleaders. So, we are into predicting
one random variable, height, from three
components, weight, gender, and
cheerleader or not, of random variable X.
Sure, the component weight is also a
random variable, and similarly for gender
and cheerleader or not. So, if you will,
we are predicting random variable Y height
from three random variable or three
components of the one random variable X --
your choice.
Since we can get a conditional
distribution, we can take the expectation
of the distribution and get a conditional
expectation. So, the conditional
expectation of Y given that X = x is
written E[Y|X = x]. This is correct, but
with details we could spend a hour here so
let's just move along.
Well, the conditional expectation
E[Y|X = x]
is a function of the real number x. We can
also regard it as more simply a function
of the random variable X and write this as
E[Y|X], the conditional expectation of
random variable Y given (the values of)
random variable X. So, for some function f
with domain the values of X, we have f(X)
= E[Y|X]. So, under meager assumptions
that essentially always hold in practice
f(X) is also a random variable.
It turns out, nicely enough, that the
expectation of E[Y|X] is the expectation
of Y itself;
E[E[Y|X]] = E[f(X)] = E[Y]
We want to show that f(X) = E[Y|X] is the
least squares (minimum squared error)
approximation of Y possible from X. So
f(X) = E[Y|X] is the best predictor of Y,
better than anything we could hope to do
with regression, neural networks, anything
linear, anything nonlinear, essentially
anything at all.
For the proof, we start with something
simple: We show that a = E[Y] minimizes
E[(Y - a)^2]. That is, we are finding the
single number a that best approximates
real random variable Y in the least
squares sense. Well, we have
E[(Y - a)^2]
= E[Y^2 - 2 Ya + a^2]
= E[Y^2] - 2aE[Y] + a^2
= E[Y^2] + E[Y]^2 - 2aE[Y] + a^2 - E[Y]^2
= E[Y^2] + (E[Y] - a)^2 - E[Y]^2
which we minimize with E[Y] = a.
Or, for one interpretation, the minimum
rotational moment of inertia is for
rotation about the center of mass.
So, for our main concern, suppose we want
to use the data we have X to approximate
Y. So, we want real valued function g with
domain the possible values of X so that
g(X) approximates Y.
For the most accurate approximation, we
want to minimize
E[(Y - g(X))^2]
Claim: For g(X) we want
g(X) = E[Y|X]
So, g is the best non-linear least squares
approximation to Y.
Proof:
We start by using one of the properties of
conditional expectation and then continue
with just simple algebra much as we did
for E[(Y - a)^2] just above:
E[(Y - g(X))^2]
= E[ E[Y^2 - 2Yg(X) + g(X)^2|X] ]
= E[ E[Y^2|X] - 2g(X)E[Y|X]
+ g(X)^2 ]
= E[ E[Y^2|X] E[Y|X]^2 - 2g(X)E[Y|X]
+ g(X)^2 - E[Y|X]^2 ]
= E[ E[Y^2|X]
+ (E[Y|X] - g(X))^2
- E[Y|X]^2 ]
which is minimized with
g(X) = E[Y|X]
Done.
And how do we use this?
We just discretize X and use cross
tabulation which directly estimates
E[Y|X]. To be more clear, given real
number x, to predict Y when X = x, we use
cross tabulation to estimate E[Y|X = x] =
f(x).
The good news is what we have proven: We
have made the best prediction in the least
squares sense of Y possible from our data
on X.
The bad news is that the amount of data we
need grows exponentially in the number of
different components in X. I.e., we
considered three variables in X, weight,
gender, and cheerleader. The amount of
data we need grows as the number of
discrete values of X which grows
exponentially in the number of components
in X.
If somehow we KNOW that Y is a linear
function of the components of X, then we
might consider using regression. If we
are doing just empirical curve fitting and
have enough data, which in practice likely
means relatively few components in X, then
we can consider using just cross
tabulation. But if X has thousands of
components, then the exponential explosion
in the amount of data required on X likely
means that we will have to set aside cross
tabulation, look for a way to use, maybe,
a few dozen components of X, or go ahead
and use the machine learning material in
the Bloomberg course.
Uh, if are trying to predict people, there
is an old hint -- people are no more than
about 14 dimensional beings. That is,
given 1000 variables on each of 10 million
people, using principle components we
should be able to reproduce the 1000
variables quite accurately using just 14
principle components obtained from the
1000 with just a linear transformation.
Then use cross tabulation on the 14
variables -- maybe.
Errata: "So, if you will, we are predicting random variable Y height from three random variable or three components of the one random variable X -- your choice."
should read
"So, if you will, we are predicting random variable Y height from three random variables or three components of the one random variable X -- your choice."
My solution never mentioned Bayes! So, don't have to "be a Bayesian" to use what I wrote.
I just used conditional expectation which, of course, with full details, is from the Radon-Nikodym theorem in measure theory as in, say, Rudin, Real and Complex Analysis with a nice, novel proof from von Neumann.
You seem to have a preference for an approach in which you assume certain things are true about the world (e.g. y is a linear function of x), and then you derive some optimal prediction function, based on that assumption, under some definition of optimal. And that seems fine. In that example, you'd end up with the same results if you decided to restrict your search for a prediction function to a hypothesis space containing only linear functions -- not because you think the world obeys a linear function, but because you happen to like working with linear functions. I do agree that you can get insight into a method by knowing when it's the optimal method. We do talk about conditional probability models in Lecture 17, where we can assume that distribution of y has a specific form given x (although again we frame it as restricting your search space, rather than as an assumption about the world).
About the "throwing stuff against the wall until something appears to stick." First of all, I don't entirely object to this approach, in the sense that I don't think it's dangerous, so long as you follow the standard procedures of machine learning. And it's where somebody would be if, for example, they went through Lecture 1 on Black Box Machine Learning. But the other 29 lectures are building towards more than that. For example, in Lecture 6 and 7 we get a pretty careful understanding of how L1 and L2 regularization behave in different situations. In Lecture 19, we connect L1 and L2 regularization to various prior beliefs you have about the values of the true coefficients (in a Bayesian framework). We do examine the hypothesis spaces of trees, boosted trees, and neural networks, and consider the tradeoffs (piecewise constant vs smooth, trainable by gradient methods vs not). Yes, there is absolutely plenty of "just try it" in machine learning. Most of the theory of machine learning (generalization bounds, etc) is about when it's ok to estimate performance on future data based on performance on data you have now. We don't have to believe the world obeys a certain model for this to work, we only have to believe that the world will behave the same way in the future as it does now.
It's unfortunate that there wasn't more time in the class for factor analysis, although we do have a thorough treatment of the EM algorithm (Lecture 27), which is what you'd need for that. I used to give a similar argument about 'crosstab' and the curse of dimensionality (https://davidrosenberg.github.io/mlcourse/Archive/2015/Lectu...). What other methods would you have liked to see in the course? To scope it, the course is focused on making good prediction on future data.
> We do talk about conditional probability models in Lecture 17, where we can assume that distribution of y has a specific form given x (although again we frame it as restricting your search space, rather than as an assumption about the world).
With what I wrote, just use cross tabulation justified as the best least squares estimate just from my short derivation based on conditional expectation. Don't have to mention Bayes.
> assume that distribution of y has a specific form given x
With what I wrote, don't have to do that. To be quite precise, do need to assume that random variable Y has an expectation, but in practice that is a meager assumption. Otherwise in what I wrote, essentially don't have to make any assumptions at all about distributions (e.g., do want to assume that Y has an expectation).
Here I have a remark to the statistics community, especially the beginning students and their teachers, a remark old for me: Yes, we have random variables. Yes, each random variable, even the multidimensional ones, has a cumulative distribution. Yes, it's often not too much to assume that we have a probability density. Yes, there are some important probability densities, exponential, Poisson, ..., especially Gaussian. Yes, some of these densities have some amazing properties -- e.g., for the Gaussian, sample mean and variance are sufficient statistics (see a famous 1940s paper by Halmos and Savage and a later paper on lack of stability by E. Dynkin).
So, from that, a suggestion accepted implicitly is that in work in applied statistics early on we should seek to find "the distribution". If the usual suspects don't fit, then we should try some flexible alternatives and try to make them fit.
If you want some spreadsheet program to draw you a histogram, okay. With a good program, you can get a histogram for two dimensional data. If you strain with your eyes and some tricky software, you have an outside shot at three dimensions. For more than three dimensions, you are in Death Valley in July out'a gas and walking -- not quite hopeless yet but not in a good situation.
Sorry, on finding the distribution, really, essentially NOPE -- don't try that, basically can't do that. Blunt reality check: Yes, distributions exist but, no, usually can't find them and, actually, mostly work without knowing them. Maybe all you know is that variance is finite! This situation holds mostly true for real valued random variables. For multivariate valued random variables, e.g., vector valued, as the dimension rises above 2, are getting into trouble; before the dimension gets to be a dozen, are lost at sea; anything above a few dozen are lost in deep outer space without a paddle and hopeless.
> We don't have to believe the world obeys a certain model for this to work, we only have to believe that the world will behave the same way in the future as it does now.
Yes, classic texts on regression and some of what J. Tukey wrote about how regression has commonly been used in practice should have made this point more clear and gone ahead and been more clear about over fitting, etc.
Also, with this approach, given some positive integer n and some data (y_i, x_i), i = 1, 2, ..., n, maybe if we keep looking, e.g., fit with ln(x_i), sin(x_i), exp(x_i), ..., eventually might get a good fit. Then do we believe that the system behaved this way in the past so with meager assumptions will do so in the future? I confess, this is a small, philosophical point.
Better points:
(A) Just from the course, for any model building, we have a dozen, maybe more, efforts at TIFO -- try it and find out. That's a bit wasteful of time -- right, automate that, have the computers do all the usual dozen. Hmm.
(B) We are so short on assumptions, we will have to be short on consequences of theorems about what what we have or might have or might do. Again, we are like Ptolemy instead of Newton.
Yes, I neglected to mention the famous and appropriate "curse of dimensionality".
> What other methods would you have liked to see in the course?
As I wrote, there's a lot on the shelves of the research libraries. No one course can cover all that might be useful for solving real problems.
So, another approach is first to pick the problem and then pick the solution techniques. Even for some solution techniques that might be, even have been, useful, tough to say that they should be in a course. So, maybe what should be in a course is a start, lots of powerful, general views, that enable, given a particular problem, picking a suitable solution technique. And, even with this broad approach, it might be productive to have some specializations.
I have a gut feeling that some of the computer science community took regression a bit too seriously.
Sure, I really like and respect Leo Breiman and can respect his work on classification and regression trees (do have his book but never bothered to use or get the software), but I like his text Probability published by SIAM a lot more, have it, have studied it, along with Neveu, Chung, Loeve, Skorohod, etc.
So, to suggest a prediction technique not much related to regression? Okay: Blue has some subsurface ballistic missile (SSBN) boats, i.e., submarines that while submerged can fire intercontinental ballistic missiles, each with several nuclear warheads that can be independently targeted.
These SSBNs hide in the oceans of the world. To help remain hidden, they execute some approximation to Brownian motion. E.g., even if Red gets some data on where a Blue SSBN is, a day or so later Red will have much less good data on where it is.
If Red can find an SSBN, then sinking it is easy enough -- just explode a nuke over it. But if sink one, then that starts a nuclear war, and the remaining SSBNs might shoot. So, Red could try to find and sink all the SSBNs at once -- finding them all, all at the same time, is unlikely, and mounting a search effort that would promise to do that would be provocative.
But, for considering world war scenarios, maybe there would be a nuclear war but limited just to sea. Then could sink SSBNs one at a time as find them. In this scenario, how long might they last?
Getting a good answer is your mission, and you have to accept it! Gee, we could try regression analysis, maybe neural networks?
Or: How to find things at sea was considered back in WWII -- that was an important question then, also. There was a report OEG-56 by B. Koopman. The report argued, with some clever applied math, that with an okay approximation the encounter rates were as in a Poisson process with the encounter rate a known algebraic function, in the report, in terms of the area of the ocean (for real oceans, neglect the details of the shape), speed of the target, speed of the searcher, and detection radius, all IIRC.
So, ..., can argue that for the war at sea, encounters are like various Poisson processes. Altogether the Red-Blue inventory declines like a multi-dimensional, continuous time, discrete state space (huge state space) Markov process subordinated to a Poisson process (for a good start, see the E. Cinlar Introduction to Stochastic Processes).
So, given the speeds, radii, and what happens -- one dies, other dies, both die, neither die -- one on one, Red-Blue encounters, have all the data needed.
Now with that data, we can make some predictions. Sure, there's a closed form solution, but the "curse of dimensionality" gets in the way. But, Monte Carlo and the strong law of large numbers to the rescue: Write software for a few days and run it for a few minutes, run off 1000 sample paths, average, and get the prediction with some decently good confidence intervals.
I did that. Later the work was sold to a leading intelligence agency. I could tell you which one, but then I'd have to ...!
"Look, Ma, no regression analysis!".
I've got some more, but this post is getting long!
Lesson: There's a lot more to applied statistics than empirical curve fitting.
Hehe ok —- I also love Breiman’s Probability book. It’s really a standout on Ergodic theory. And Breiman et al.’s book on Trees is surprisingly rich, talking about all sorts of stuff besides trees.
As much as I respect Brieman, I think he latched on too hard on his pet theory that all that ensembles do is reduce variance, and by doing so missed out on what boosting does.
Yeah, random forests work really well but they are layers and layers of hacks, thumb rules and intuition piled on top of the other. I cant claim with a straight face that any of them follows from solid principles.
Graycat and I have a history of discussing the differences between stats and ML here on HN. I just added a comment, up streams on the thread.
I imagine Breiman was just talking about bagging-style parallel ensembles, when he was talking about variance reduction, not boosting-style sequential ensembles. Not long before he died, he was still actively trying to figure out why AdaBoost “works”. Don’t think he claimed to really understand that. He had experimental results that disputed the “it’s just maximizing the margin” explanation.
Saw the comments above — are you from a stats or ML background, or neither?
You can take a look at his technical reports from 2001 onwards, lot of very interesting material, lot of back forth between Freund, Schapire and Brieman in those. The reports continue to be hosted on Brieman's home page.
For ergodic theory, there were a few really good lectures, and I have good notes, in a course I took from a star student of E. Cinlar.
For Breiman Probability:
(1) I liked his start, nice, simple, intuitive, on how the law of large numbers works. Of course the super nice proof is from the martingale convergence theorem, but Breiman's start is nice.
(2) His book has the nicest start on diffusions I have seen.
(3) Once in some of my work I ran into a martingale. So, then, when I wanted to review martingale theory, I went to Breiman. Nice.
(4) Since I was interested in stochastic optimal control, I was interested in measurable selection and regular conditional distributions, and I got those from Breiman -- thanks Leo!!!! Yes, IIRC, there is the relevant Sierpinski exercise in Halmos, Measure Theory and also in Loeve, Probability, but it was Breiman who gave me what I really needed.
(5) Generally Breiman is easy to read. He writes like he is trying to help the reader learn.
I am new in machine learning and have difficulty to understand the equation. I finish 2 courses from Andrew Machine Learning Class from Coursera, I understand the flow, concept and knew how to write those equations/algorithm but it always bugs me that I don't understand those equations. Do you guys have any suggestion that where should I start for learning those equations/maths online?
If you want to keep up with the maths, the main prerequisites for more rigorous ML is typically undergrad-level calculus, linear algebra, and bit of basic probability and statistics.
For calculus, google "MIT 18.01", "MIT 18.02", (and "MIT 18.03" if you like), which are all freely available on youtube. You should be comfortable with single-variable calculus, and at least familiar with multi-variable techniques.
For linear algebra, try "MIT 18.06", which is Gilbert Strang's MIT course. Or try 3blue1brown's "The essence of linear algebra" series, which is the best explanation I've ever seen of many concepts, but it is shorter and less in-depth than a full course.
For basic statistics, try Khan Academy's "AP Statistics" sequence.
While I think 3b1b video's are great for developing a better understanding of the purpose of certain concepts, I don't think there's any way around ignore manually doing the problems with say Strangs Linear Algebra book since that's where you spend tim trying to apply these concepts to problems.
Agreed. Nothing beats practice when it comes to math. I know that I spend far too much time watching course videos, and no where near enough working through problem sets.
Working through some of Strang's problems has also helped me. 3blue1brown is a great introduction to give you intuition, but you cannot commit the skills to long-term memory without struggling through problems.
What is your math experience/background? In any case, for the average case, this course is probably the best for those who want a slightly deeper understanding of machine learning than offered by Andrew Ng's courses: https://work.caltech.edu/lectures.html
I also really like the Abu-Mostafa course from caltech you link, and their book. If you want to get a taste of generalization bounds and statistical learning theory (e.g. VC dimension), he gives the gentlest introduction I've seen.
can say at a very basic level ... I don't really focus on the study at my early age. I still can understand the equation but not really know why and how to implement it, and come to a very complex situation I can't make sense on the calculation ... Every calculation that Andrew Ng teach are new to me even for the matrix, vector, and sum over are new to me. Even tho Andrew Ng did say "if you don't understand math is ok." but I still love to understand math for my machine learning career, that's why I seek for help.
Some intuition about minima (gradual descent), linear regression, etc. will get you to a certain level--which admittedly may be a bit hard if you've never had the math. But as others have suggested, you probably need to equivalent of a few semesters of calculus and linear algebra (which I never took in a formal way though I did some variants of pre-MatLab).
Several of the comments in this thread clarify what the Bloomberg machine learning work is about. Then with the clarification, maybe I see some issues with the work.
We can start with just the simplest case, ordinary, plain old regression with just one independent variable. So, for some positive integer n, we have pairs of real numbers (y_i, x_i) for i = 1, 2, ..., n. Here we are trying to predict the y_i from the x_i; we are trying to build a model that will predict y from a corresponding x where x is not in the data. Our model is to be a straight line, e.g., in high school form
y = ax + b
Okay, doing this, with the classic assumptions, we can draw a graph of the data, the fitted line, and the confidence (prediction) interval at each real number x.
For some intuition, suppose the x_i are all between 0 and 5. Maybe then the fit is quite good, the line is close to the data, and the confidence intervals are small.
But, IIRC, roughly or exactly, the confidence interval curves are actually hyperbolas. So, while the upper and lower curves are close for x between 0 and 5, for x outside the interval [0,5] the curves can grow far apart. So, if we are interested in the predictions of the model for, say, x = 20, the confidence intervals may be very wide, enough for us to conclude that, even though we have a straight line that fits our data closely, still our model is useless at x >= 20.
So, this little example illustrates a broad point about such curve fitting: The model might work well for independent variable x (likely a vector of several components) close to the training data but commonly be awful otherwise.
How serious this situation is can vary a lot depending on the application. E.g., if are interested only in the values of y for x a lot like the training data, e.g., in the interval [0,5], then maybe don't are about the y value or the confidence interval when x = 20.
But quite broadly there are applications where what we want such models to tell us is the value of y for some x not close to what we have seen in our training data.
Uh, let's see: IIRC Bloomberg is selling stock market and economic data, often in nearly real time, to investors, some of whom are traders and make trades quickly, within a few seconds, based on the data from Bloomberg.
I'm no up to date expert on just what the Bloomberg customers are doing, but from 20,000 feet or so up maybe the situation is something like, broadly the investors vary:
(A) Some investors want a portfolio constructed much like in the work of H. Markowitz or W. Sharpe. In simple terms, they want the portfolio to have good expected return with low standard deviation of return and maybe, then, buy on margin to raise the rate of return while still having the risk relatively low.
(B) Some investors are interested in relationships between stocks and options -- e.g., the Black-Scholes work is an example of this. IIRC, a more general case is some stochastic process, maybe Brownian motion, reaching a boundary and a first exit. The exit has a value, so the investment problem is a boundary value problem. IIRC can design and attempt to evaluate exotic options with such ideas.
(C) Some investors are just stock pickers and buy when they sense a sudden rise in price.
But a theme in (A)-(C) is that the investors are looking for something unusual. So, in the model building, the unusual may have been unusual in the training data and the testing data. In that case, without more assumptions, theories, or whatever the prediction of the model for unusual input data may be poor.
That is, it appears that the model building techniques promise that the model will do poorly in just the application cases of greatest interest to investors -- the unusual cases that are not well represented in the training and test data.
So, maybe first cut some of what is needed is some anomaly detection.
So, we could use more information about the systems we are trying to model. A linearity assumption is one such. In Newton's second law and law of gravity, we can check that for falling apples. Next we can try on the planets in our solar system and, nicely enough, see that it works. And then we can be pretty sure for a rocket at Mach 15 headed for stationary orbit, etc. But with just empirical curve fitting, apparently mostly we don't have such additional information.
IIRC L. Breiman's first interest in empirical curve fitting was for clinical medical data. So, maybe in that data he was trying to predict some disease but the independent variable data he was using was common in his training and testing data. I.e., he wasn't really looking for exploiting some anomaly for some once in 20 years way to get rich quick.
So the extrapolation-type problem you describe (an input not near any of your training examples) is an issue. Unless you have a world model you believe in (i.e. you've done some science -- not just statistics), hard to know if your prediction function works out there where you’ve never seen any examples. If you’ve seen some data out there, but relatively fewer than you see in deployment, then importance weighting or other approaches from covariate shift / domain adaptation could help.
Anomaly detection is definitely another important area, but I struggle to pull together a coherent unit on the topic. One issue is that it’s difficult to define precisely, at least partly because everybody means something a little bit different by it.
Also, based on classical hypothesis testing, I think that to some extent you have to know what you’re looking for to be able to detect it (ie to have power against the alternatives/anomalies you care about)... For that reason, I think it’s hard to separate anomaly detection from more general risk analysis/assessment, because you need to know the type of thing you care about finding.
So, for anomaly detection, before evaluating the model at x, might want to know if x would be an anomaly in the training data x_i, i = 1. 2, ..., n. Sure, x is likely a vector with several to many components.
An anomaly detector should be at least as good as a statistical hypothesis test.
So, for the null hypothesis, assume that x is distributed like the training data.
Okay, except we don't really know the distribution of the training data.
"Ma! Help! What am I supposed to do now???"
So, we need a statistical hypothesis test that is both multi-dimensional and distribution-free.
Let's, see: In ergodic theory we consider transformations that are measure preserving .... Yup, can have a group (as in abstract algebra) of those, sum over the group, ..., and calculate the significance level of the test and, thus, get a real hypothesis test, multi-dimensional and distribution free. For some of the details of the test, there are lots of variations, i.e., options, knobs to turn.
Detection rate? Hmm. Depends ...! Don't have data enough to use the Neyman-Person approach, but in a curious but still relevant sense the detection rate is the highest possible.
I just call this work statistics, but maybe it would also qualify according to some definitions as machine learning. But my work is not merely heuristic and has nothing to do with regression analysis or neural networks. So, again, my work is an example that there can be more to machine learning than empirical curve fitting.
So, before applying an empirically fitted model at x, want x to be distributed like the training data and at least want an hypothesis test not to reject the null hypothesis that x is so distributed.
More generally, if are looking for anomalies in the data, say, a rapid real time stream, when see an anomaly, investigate further. In this case, an anomaly detector is a first cut filter, an alarm, to justify further investigation.
Looking back on what I did, I suspect that more could be done and that some of what I did could be done better.
Of course, my interests now are my startup. Yes, there the crucial core is some applied math I derived.
Maybe I'll use my anomaly detection work for real-time monitoring for zero-day problems in security, performance, failures, etc. in my server farm.
As a very general but crude and blunt approach to show that the hypotheses tests were not trivial, used the result of S. Ulam that Le Cam called "tightness" as in P. Billingsley, Convergence of Probability Measures. When doing both multi-dimensional and distrbution-free, are nearly way out in the ozone so get pushed into some abstract techniques! Meow!
Okay, I started watching your video lectures at your 15 City Sense Data and through your application of maximum likelihood estimation.
So, you drag out some common distributions, e.g., Poisson, negative binomial, beta, and look for the best "fit" for your data. Here you have little or no reason to believe that the data has any of those distributions. So, the work is all rules of thumb, intuitive heuristics, crude considerations, e.g., discrete or continuous and what the "support" is, etc. Then the quality is added on later -- see if the results are good on the test data. If not, then try again with more heuristics. If so, then use the "model" in practice. Cute. Apparently you are calling this two step technique, (i) use heuristics and then (ii) validate with the training data the influence of "machine learning". Okay.
But there is a more traditional approach: When we use, e.g., the Poisson distribution, it is because we have good reason to believe that the data really is Poisson distributed. Maybe we just estimate the one parameter of the distribution and then continue on. It will be good also to check with some test data, but really, logically it is not necessary. Moreover, we may have some predictions for which we have no test data.
An example of this traditional approach where we have no test data and, really, no "training" data in the sense you are assuming, is the work I outlined for predicting the lifetime of submarines in a special scenario of global nuclear war limited to sea. There we got a Poisson process from some work of Koopmans. But we had no data on sinking submarines in real wars, not for "training" or "testing". We can similar things for, say, arrivals at a Web site because from the renewal theorem we have good reason to believe that the arrivals, over, say, 10 minutes at a time, form a Poisson process. Then we could estimate the arrival rate and, if we wish, do an hypothesis test for anomalies where the null hypothesis was that the arrivals were as in our Poisson process, all without any heuristic fitting and detecting anomalies that were not in any data we had observed so far.
So, generally we have two approaches:
(1) Machine learning where we get test data, use heuristics to build a model, test the model on test data, and declare success when we pass the test.
(2) Math assumptions where we have reason to know the distributions and form of a model, and use possibly much less data to determine the relatively few free parameters in the model and then just apply the model (testing also if we have some suitable test data -- in the submarine model, we did not).
A problem for approach (1) machine learning is that we should be fairly sure that when we apply the model at point x, point x is distributed as in the training data. So, essentially we are just interpolating.
E.g., if we have positive integers n and d, the set of real numbers R, training data pairs (y_i, x_i) for i = 1, 2, ..., n where y_i in R and x_i in R^d, fit a model that predicts each y_i from each_x_i, then to apply the model at x in R^d, we should know that x is distributed like the x_i.
So, we are assuming that the x_i have a "distribution", that is, are independent, identically distributed (i.i.d.) "samples" from some distribution. And our model is a function f: R^d --> R. Then to apply our model a x in R^d we evaluate f(x). We hope that our f(x) will be a good approximation of the y in R that would correspond to our x in R^d. So, again, to do this, we should check if x in R^d is distributed like the x_i in R^d.
I did see your work on anomaly detection, but it was much simpler and really not in the same context as an hypothesis test with null hypothesis that the x and x_i are i.i.d. in R^d.
I can believe that at times approach (1) can work and be useful in practice, but (A) a lot of data is needed (e.g., can't apply this approach to the problem of the submarine lifetimes) and (B) are limited in the values of x that can be used in the model (can't do an hypothesis test to detect anomalous arrivals beyond any seen before at a Web site).
From all I could see in your lectures, you neglected approach (2); readers should be informed that something like (2) is missing.
(Includes recorded lectures) https://sites.google.com/site/10715advancedmlintro2017f/lect...
CMU also has an 'Applied Machine Learning' undergrad course that is paywalled fully unfortunately, but they use the text: Witten, I. H. & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, second edition