Foundations Machine Learning

hackermailman · on July 13, 2018

If you know basic undergrad probability/statistics/linear algebra (matrices, vectors, eigenvalues) you can do CMU's graduate course in ML that is intended to prepare PhD students to understand research papers in the field

(Includes recorded lectures) https://sites.google.com/site/10715advancedmlintro2017f/lect...

CMU also has an 'Applied Machine Learning' undergrad course that is paywalled fully unfortunately, but they use the text: Witten, I. H. & Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques, second edition

cbHXBY1D · on July 13, 2018

And what about after that? In my experience at least, I can highly recommend Stanford's EE364a Convex Optimization. http://web.stanford.edu/class/ee364a/

When I was an undergrad at Berkeley, one particularly well known research lab would only look at your resume if you took that course online. Warning: that class is not for the faint of heart. And be good at linear algebra.

hackermailman · on July 13, 2018

CMU also has a convex optimization course w/open recorded lectures on youtube http://www.stat.cmu.edu/~ryantibs/convexopt/

None of these classes are for the feint of heart if you're actually enrolled in them and operating under crazy deadlines with thousands of dollars on the line. This is clearly for our casual interest and hacking around with ML research hobby.

graycat · on July 13, 2018

Stephen Boyd's Convex Optimization,

http://stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

is gorgeous material. Large parts it are from theorems of the alternative, linear programming, Jensen's inequality, R. T. Rockafellar, etc., but Boyd makes it all a clean whole.

That's the upside! The downside is that, IMHO, rarely does any one person need more than a small part of the whole book, and the parts that someone needs are likely also available in the subject they are working with.

So, at least download Boyd's book and use it for reference: If ever have a question about convexity, then start with Boyd!

davidsrosenberg · on July 14, 2018

Funny - I had the same thoughts and Boyd and Vandenberghe’s book, which is why I compiled this “extreme abridgment” for what you need for the class: https://davidrosenberg.github.io/mlcourse/Notes/convex-optim...

iampno · on July 27, 2018

throwawaymath · on July 13, 2018

Which research lab was that?

colmvp · on July 13, 2018

Alas, it seems that they don't have videos for all their lectures?

stochastic_monk · on July 13, 2018

The title of this page feels like a reference to the excellent Mohri et al. Foundations of Statistical Machine Learning. I’d recommend both it and Shai Shalev-Schwartz for VC theory/Rademacher complexity sorts of statistical ML.

However, I’m not crazy about this summary. It’s less foundational than it is a survey. I probably dislike the format and formatting more than anything, but I would not recommend this over other resources.

davidsrosenberg · on July 14, 2018

This course is complementary to Mohri's excellent book and course. Many students at NYU take both courses, in either order (https://davidrosenberg.github.io/ml2018/ and https://cs.nyu.edu/~mohri/ml17/). Mohri's course builds a foundation for proving performance guarantees (yes, using tools such as VCdim and Rademacher complexity). This course tries to be practical, but not superficial. We do a careful study of multiple examples of the four fundamental components of an ML method: loss function, regularization, hypothesis space, and optimization method. (In a probabilistic setting, regularization becomes a prior and loss becomes a likelihood.) Framed in this way, it's usually much easier to understand or invent new methods. And within this framework, we absolutely try to survey as many methods as we have time to look at carefully.

stochastic_monk · on July 14, 2018

Hi David! Thank you for your thorough response. I appreciate it. And I understand why one would choose to cover as much breadth as you do. I don’t think there’s an easy way to cover all the basics without feeling like a survey, but keeping track of core concepts across applications was what helped me understand how they were connected.

benrbray · on July 13, 2018

Yeah, this space feels pretty saturated already, and people seem to use the same topics / presentation / ordering over and over. What a ton of duplicate effort!

It'd be nice to see a course try to put its own spin on "machine learning" and 1) present standard topics in an unusual way and 2) include topics that intro ML students might not normally see.

LolWolf · on July 13, 2018

Take a peek at our class [0]! We present a bunch of topics not covered in undergraduate courses (at least to our knowledge), such as proximal gradient descent, random features, etc., without referencing any probability. The course is aimed at mostly Sophomores/Juniors in any STEM field with the only prerequisite being the introductory linear algebra course EE103. [1] All of the material (minus solutions) is available online for both courses.

I'm probably highly biased, but I'd like to say this is a fairly fresh take which departs heavily from the usual CS229 (Ng's course) presentation style and order since it's meant for a completely different audience (and was, to be fair, written this past quarter, unlike 229 which was written perhaps 15 years ago).

---

[0] http://ee104.stanford.edu

[1] http://ee103.stanford.edu

stochastic_monk · on July 13, 2018

I used Stanford’s materials for CS229 as a reference in two of the three graduate ML classes I’ve taken. I’ve been quite impressed, and I don’t doubt ee103 is also quite well done.

wodenokoto · on July 13, 2018

> All of the material (minus solutions) is available online for both courses.

I can only find lecture slide when looking at the course site.

Are there no readings or problems?

LolWolf · on July 13, 2018

There are problems, but I’m afraid that we took them down since we need to rewrite a few before next year. Sorry! I’m sure poking around the site might yield some results, if you’re careful enough ;) (at least for 104, someone else is teaching 103 and I’m not sure what all they’re doing with the course, but all of the main problems are available on the book’s website).

davidsrosenberg · on July 14, 2018

Here are some of the things that I think are distinctive about the class (although certainly all of these are taught in some other class somewhere): discussion of approximation error, estimation error, and optimization error, rather than the more vague “bias / variance” trade off (common in more theoretical classes, such as Mohri's, but not common in practically-oriented classes); full treatment of gradient boosting, one of the most successful ML algorithms in use today (most courses stop with AdaBoost, which is rarely used for reasons discussed in the course); much more emphasis on conditional probability modeling than is typical (you give me an input, I give you a probability distribution over outcomes — useful for anomaly detection and prediction intervals, among other things), explanation (including pictures) for what happens with ridge, lasso, and elastic net in the [very common in practice] case of correlated features (I haven't seen this in another class, though I'm sure others have done it, somewhere); guided derivation (in homework) of when the penalty forms and constraint forms of regularization are equivalent, which gives you much more flexibility in how you formulate your problem (I know this is uncommon because it took me a very long time to find a reference), multiclass classification as an introduction to structure prediction (idea taken from Shalev-Shwartz and Ben-David's book). On the homework side, you’d code neural networks in a computation graph framework written from scratch in numpy; well, we're pretty relentless about implementing almost every major ML method we discuss from scratch in the homework.

was_boring · on July 13, 2018

Is that because they are all online? After all, how many comm 101 courses are taught every year?

graycat · on July 13, 2018

I looked over the whole list of topics. They have a page of questions about prerequisites, and for all the questions I was able to answer that I was fully comfortable with the question. But the questions have to do with probability and optimization, and my applied math Ph.D. was in those fields with research in stochastic optimal control.

I can begin to see some of why Bloomberg is interested in the material: Maybe with all the economic and stock market data they collect, they can build some surprisingly useful predictive machine learning models. Okay. Maybe one good result will be that Michael Bloomberg will give some more money to Johns Hopkins!

But otherwise I was disappointed:

(1) Put simply and bluntly, it's all about just one now very old topic -- regression analysis. Well, when old regression analysis doesn't fit very well, then try logistic regression, ridge regression, regression trees, other forms of regression, other forms of curve fitting, e.g., neural networks.

Or, look, guys, it's all just empirical curve fitting. Ptolemy tried empirical curve fitting. He used his epicycles. They didn't fit very well. Then, later, from work of Kepler, a falling apple, etc., Newton guessed that (A) there was a law of motion due to force equals mass time acceleration and (B) there was a force directly proportional to the product of two masses and inversely proportional the the square of the distance between them. That fit the data great!

Lesson: Lots of things change continuously. Usually the changes are differentiable, that is, have a well defined tangent. Well, that tangent is linear. So at least locally, lots of things are linear. So, for lots of things, a promising first cut approach is to do a linear fit. For more, neural networks can approximate anything continuous, and lots of things are continuous. So, net, curve fitting has some promise and utility. But, still, curve fitting is just guessing without any real basis in science, e.g., couldn't find or replace what Newton did.

Really, the now classic texts in regression analysis start with something the machine learning curve fitting does not: The classic texts assume that there really is a linear equation, that we would have the equation exactly except for some errors in the data, and then do some nice applied math to show how to get the errors down and get a good approximation to the equation that has already been assumed to exist. The assumptions can vary, stronger or weaker, but in the theory there was little or no role for just empirical fitting. Well, machine learning is charging ahead without that assumption that a linear equation exists. And, wonder of wonders, apparently often now such an equation, even as a good approximation, doesn't exist. So, we are back to empirical curve fitting, struggling like Ptolemy. We already know: Some successes are possible, but like Ptolemy we face some severe limitations.

(2) Okay, when simple regression doesn't fit very well, we keep trying? Okay. Say, we try logistic regression, ridge regression, L1 or L2 regularization regression, regression trees, boosting, ..., neural networks, etc. Uh, which is better, L1 regularization or L2 regularization? Uh, this sounds like throwing stuff against the wall until something appears to stick. Sure, that can work at times, but are we really satisfied with that? Wouldn't we want some more solid reasons for the tool we pick? Lots of places elsewhere in applied math, applied probability, and applied statistics we do have solid reasons.

With so many efforts to patch up regression, maybe we might suspect that for a lot of problems regression is not the right tool?

(3) There is a lot more to applied statistics, applied probability, etc. and possibly of value for applications, maybe including Bloomberg's customers, than empirical curve fitting. Commonly this work has equal justification to be called machine learning because it also takes in data, estimates some parameters, builds a model, and gives results. Moreover, the better cases work with clear assumptions so that when the assumptions hold we know we have something solid, and some of the methods use meager assumptions that are relatively realistic in practice. The Bloomberg course is just empirical curve fitting, nearly all versions of regression analysis, and omits all the rest. On the applied math shelves of the research libraries, regression is only a tiny fraction of the whole. Where's the rest?

clircle · on July 13, 2018

Nice comment, echoes a lot of my feelings about ML. I have a question. You write

> (2) Okay, when simple regression doesn't fit very well, we keep trying? Okay. Say, we try logistic regression, ridge regression, L1 or L2 regularization regression, regression trees, boosting, ..., neural networks, etc. Uh, which is better, L1 regularization or L2 regularization? Uh, this sounds like throwing stuff against the wall until something appears to stick. Sure, that can work at times, but are we really satisfied with that? Wouldn't we want some more solid reasons for the tool we pick? Lots of places elsewhere in applied math, applied probability, and applied statistics we do have solid reasons.

I'm wondering what these "solid reasons" could be? Some sort of experience based on past data? An example would be helpful.

graycat · on July 13, 2018

Let's see: Pick a system that we know to be linear. Get a lot of pairs of real inputs and outputs and find the coefficients of the linear function that relates the inputs to the outputs. Then given a new input, can say what the output would be.

Okay, the function, system between a violin on the stage at Carnegie Hall and a seat near the roof is linear. So, I'm typing quickly here, do regression with a recording at the stage and at the seat and look for the coefficients in the convolution. Then given an oboe, can say what it will sound like in the seat.

Hooks law with small deflections is linear. So, take independent variables the forces on a space frame and the dependent variables the deflections and estimate all the spring stiffness values.

Take 200 recipes for tomato sauce all from the same 10 ingredients, for each recipe measure the weight of each ingredient and the weight of the protein in the final sauce and estimate the protein in each of the 10 ingredients. Then for any new tomato sauce recipe, weigh the ingredients and get the protein in the final sauce.

graycat · on July 14, 2018

> I'm wondering what these "solid reasons" could be?

For the classic assumptions for regression, we know that a linear function exists, our data has additive errors that have mean zero, constant variance, and are Gaussian.

Then we can get an F ratio test on the regression, t-tests on the coefficients, and confidence intervals on the predicted values. This is just the standard stuff in the classic derivations in several of the books I listed.

And with some weaker assumptions, there are still some such results we can get.

This is just for regression.

There's a lot more in applied statistics, that is, where we make some assumptions that seem accurate enough in practice, get some theorems, and benefit from the conclusions of the theorems.

For an example, also in this tread I wrote about estimating how long some submarines would last. The assumptions were explicit and, on a nice day, somewhat credible. If swallow the assumptions, then have to take the conclusions seriously.

I've done other problems in applied statistics -- have a problem, collect some data, see what assumptions can make, from the assumptions, have some theorems, from the theorems have some conclusions powerful for the problem. If swallow the assumptions, trust the software, ..., then have to take the conclusions seriously. So, get to check the software and argue about the assumptions.

For machine learning as in the Bloomberg course, have some training data and some test, validation data. Fit to the training data. Check with the test data. Assume that the real world situation doesn't change, apply the model, and smile on the way to the bank.

I guess, okay when it works. But:

(A) how many valuable real applications are there? I.e., regression has been around with solid software for decades, and I've yet to see the yachts of the regression practitioners -- maybe there are some now but what is new might mostly be hype. Or the problem was that the SAS founders were not very good at sales? Same for SPSS (now pushed by IBM), Mathlab, R, etc.?

(B) Wouldn't we also want confidence intervals on the predictions? Okay, maybe there are some resampling/bootstrap ways to get those, maybe.

(C) There's more to applied statistics than empirical curve fitting, and commonly there we can have "solid reasons". For more on applied statistics, what I've done is only a drop in the bucket, ocean, but there research libraries are awash in examples.

srean · on July 14, 2018

Graycat and I have a history of conversations on HN regarding the differences between statistics and machine learning approaches.

ML is a term that came from Russian probabilists and statisticians. ML is applied probability, optimization and algorithms. A traditional statistician will be strong in the former but not necessarily in the other two.

One can of course take the 'what sticks to the wall approach' but it would be a mischaracterization to state that a principled approach does not exist in ML literature. There are theorems stating when to try one over the other and then what to expect.

The difference in stats and ML can be seen in the nature of the theorems in their body of literature.

Lets take the probabilistic setting of ML, because this matches with the assumption a sophisticated statistician would make. ML has techniques where you don't need a stochastic model, those settings are similar to adversarial game theoretic fundamentals that does not involve a stochastic sampling process.

The assumptions:

We have a joint distribution over random variables X,Y and have drawn a sample of size n from it. We have another sample from the same joint distribution but the Ys have been removed and we want to get those back.

The baby statistician's way:

God came to me in my dream and told me every thing about a function f(X, \theta) that would recover the ys. He conveyed infinite amount of information but left out a finite number of parameters (settings of knobs) for me to figure out. I was told that the specific \theta lives in the set \Theta.

The mature statistician's way:

Almost the same as above except that \theta would be living in an infinite dimensional space. In other words he can work with a less powerful god who leaves out infinite amount of information.

The theorems: A statistician would be interested in theorems where they show that their method manages to recover the \theta that god left out.

In other words they would like to find an estimator \hat{\theta} that for an infinitely sized initial sample converges to the true \theta. The crowning accomplishments are when these theorems prove that this happens with probability 1. Weaker ones are those that state that this happens with as high a probability one wants (but never quite reaching 1).

ML way:

A machine learner is not interested in theorems that characterize how close \hat{\theta} is to true \theta. Furthermore asymptotically we are all dead, so he is interested in finite sample results. His point would be that all this god mumbo-jumbo and \theta is a piece of fiction we invented anyway. No one will ever be able see these, even in principle, so he couldn't care less about those. He would care about theorems that show how the function value (and not its parameters) converges uniformly to the best function in a class of functions \cal F_n of his choosing and using finitely many samples. He would also be interested in finding a sequence of function classes so that one can consider richer and richer supersets of function classes as he get more and more data. The regularization business falls out of the function classes he wants to look at. A closely related statistical viewpoint would be those of sieves.

This is not an entirely correct characterization because there existed (and exists) probabilist and statisticians who were interested in prediction theorems rather than parameter recovery theorems, but they were far far fewer in number. An example would be Prof. Dawid.

potbelly83 · on July 13, 2018

Nice comment, " the now classic texts in regression analysis start with something the machine learning curve fitting does not...". What classic texts would you recommend starting with?

graycat · on July 13, 2018

Part I

A relatively good, introductory text on statistics, starting without either calculus or linear algebra, with nice progress into experimental design and analysis of variance, that is, multivariate statistics with discrete data, from some serious experts in that field, from University of Iowa, for some serious reasons, how to maximize corn yields considering soil chemistry, water, seed variety, plowing techniques, fertilizer, etc., long commonly used for undergraduates in the social sciences, e.g., educational statistics, agriculture, etc. is (with my mark-up for TeX):

George W.\ Snedecor and William G.\ Cochran, {\it Statistical Methods, Sixth Edition.\/}

A good, first, no toy, text on regression, with minimal prerequisites:

N.\ R.\ Draper and H.\ Smith, {\it Applied Regression Analysis.\/}

Some good books on regression and its usual generalizations, e.g., IIRC, factor analysis, i.e., principle components, discriminate analysis, etc.:

Maurice M.\ Tatsuoka, {\it Multivariate Analysis: Techniques for Educational and Psychological Research.\/}

Donald F.\ Morrison, {\it Multivariate Statistical Methods: Second Edition.\/}

William W.\ Cooley and Paul R.\ Lohnes, {\it Multivariate Data Analysis.\/}

A mathematically relatively serious text on regression:

C.\ Radhakrishna Rao, {\it Linear Statistical Inference and Its Applications:\ \ Second Edition.\/}

For multivariate statistics with discrete data, consider

George W.\ Snedecor and William G.\ Cochran, {\it Statistical Methods, Sixth Edition.\/}

Stephen E.\ Fienberg, {\it The Analysis of Cross-Classified Data.\/}

Yvonne M.\ M.\ Bishop, Stephen E.\ Fienberg, Paul W.\ Holland, {\it Discrete Multivariate Analysis:\ \ Theory and Practice.\/}

Shelby J.\ Haberman, {\it Analysis of Qualitative Data, Volume 1, Introductory Topics.\/}

Shelby J.\ Haberman, {\it Analysis of Qualitative Data, Volume 2, New Developments.\/}

The classic on the math of analysis of variance:

Henry Scheff\'e, {\it Analysis of Variance.\/}

Broadly, in all of this, we are trying to analyze data on, call it, several variables, make predictions, etc.

In all of this, and as in the title

{\it The Analysis of Cross-Classified Data.\/}

above, suppose we have in mind random variables Y and X.

What is a random variable? Go outside. Measure something. Call that the value of random variable Y. What you measured was one value of possibly many that you might have measured. Considering all those possible values, there is a cumulative distribution F_Y(y) that for any real number y we have the probability that random variable Y is <= real number y

P(Y <= y) = F_Y(y).

So, F_Y(y) is defined for all real numbers y, is at 0 at the limit of y at minus infinity and at 1 at the limit of y at plus infinity. So, as we move real number y from left to right, F_Y(y) increases -- monotonically. On a nice day, function F_Y is differentiable, and with the derivative from calculus

f_Y(y) = d/dy F_Y(y)

and is the probability density of real random variable Y.

Here's the standard way to discover something about Y, in particular about its cumulative distribution F_Y:

We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent of Y and that have the same cumulative distribution as Y. Then for positive integer n, and for real number y, by the law of large numbers (the weak version has an easy proof), in the limit as n grows large, as accurately as we please, the fraction of the values

Y_1, Y_2, ...,Y_n

that are <= y

is F_Y(y). So, via such simple random sampling, for any real number y we can estimate F_Y(y), the cumulative distribution of Y.

For a little more, under meager assumptions that hold nearly universally in practice, if we take the ordinary grade school average of

Y_1, Y_2, ..., Y_n

as n increases we will approximate the average or expected value of Y denoted by E[Y].

To define the expected value, we can use some calculus and the cumulative distribution of F_Y, but for now let's just use our intuition about averages and move along.

Now suppose we are also given random variable X. Maybe the values of X are just real numbers, some 10 real numbers, 20 values, from set {1, 2, 3}, the last three weeks 100 times a second of the NYSE price of Microsoft, or full details on the atmosphere of earth every microsecond for the past 5 billion years. That is, for the values of X we can accept a lot of generality. Still more generality is possible, but that would take us on a detour for a while.

For our point here, let's assume that X takes on just discrete values or we have just rounded off the values and forced them to be discrete. In practice we will have only finitely many discrete values.

Now we want to use X to predict Y.

So, sure, much of machine learning is to construct a model, maybe with regression trees or neural networks, to make this prediction, but here we will show a simpler way that is always the most accurate possible whenever we have enough data. How 'bout that!

This simpler way is just old cross tabulation.

Net, over a wide range of real cases trying to predict Y from X, we should just use cross tabulation unless we don't have enough data. Or, the main reason for just empirical curve fitting using regression linear models or neural network continuous models is that we don't have enough data just to use cross tabulation.

For a preview of a coming attraction, will notice in nearly all of regression and neural networks big concerns about over fitting. Well, cross tabulation doesn't have that problem. How 'bout that!

graycat · on July 13, 2018

Errata: "We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent of Y and that have the same cumulative distribution as Y."

should read

"We can imagine having random variables Y_1, Y_2, ... that are, in the sense of probability, independent and that have the same cumulative distribution as Y."

"To define the expected value, we can use some calculus and the cumulative distribution of F_Y, but for now let's just use our intuition about averages and move along."

should read

"To define the expected value, we can use some calculus and the cumulative distribution F_Y, but for now let's just use our intuition about averages and move along."

graycat · on July 13, 2018

Part II

So, let's move on to cross tabulation:

Let real number x be some value of random variable X. Suppose we take the cumulative distribution of Y when X = x (X is a random variable that might take on many different values; x is just some real number). E.g., maybe when X = x = 3 (X is a random variable; x is just the real number 3) we find the conditional probability that Y <= y given that X = x, that is, we find P(Y <= y|X = x) = F_{Y|X}(y) (Y is a random variable; y is just some real number; Y|X is a subscript).

Or we imagine a thick chocolate cake with one edge the Y axis and another edge the X axis. The surface of the cake is the joint probability density of random variables X and Y. At X = x, we cut the cake parallel to the Y axis and perpendicular to the X axis. We look at the cut surface -- that's essentially the conditional probability density of Y given X = x.

Note: In an advanced course in probability, we can show that it makes sense to consider the density at individual values of x one at a time -- the key to doing so is the subject measure theory used as the foundation for probability theory as in the 1933 paper of A. Kolmogorov.

Well, with that cut surface, we take its area and divide by it so that we have scaled the cut area to have area 1 so that the cut surface is a probability density -- we are being really intuitive here but essentially correct.

We now have the conditional probability density of random variable Y given random variable X when X = x for real number x.

Or, suppose Y is height and X is weight. With X = 110, we get the probability density distribution of height for people who weigh 110.

So, we are on the way to using weight to predict height -- no joke, this is close to fully seriously the most accurate way to proceed in the context.

If X is not just a number but two numbers, weight and gender, say, 0 for female and 1 for male, then, with weight 110 and gender female, we get the distribution of height for 110 pound females. If X has three components, with the third 1 for cheerleaders and 0 otherwise, then we can predict height given weight of female cheerleaders. So, we are into predicting one random variable, height, from three components, weight, gender, and cheerleader or not, of random variable X. Sure, the component weight is also a random variable, and similarly for gender and cheerleader or not. So, if you will, we are predicting random variable Y height from three random variable or three components of the one random variable X -- your choice.

Since we can get a conditional distribution, we can take the expectation of the distribution and get a conditional expectation. So, the conditional expectation of Y given that X = x is written E[Y|X = x]. This is correct, but with details we could spend a hour here so let's just move along.

Well, the conditional expectation

E[Y|X = x]

is a function of the real number x. We can also regard it as more simply a function of the random variable X and write this as E[Y|X], the conditional expectation of random variable Y given (the values of) random variable X. So, for some function f with domain the values of X, we have f(X) = E[Y|X]. So, under meager assumptions that essentially always hold in practice f(X) is also a random variable.

It turns out, nicely enough, that the expectation of E[Y|X] is the expectation of Y itself;

E[E[Y|X]] = E[f(X)] = E[Y]

We want to show that f(X) = E[Y|X] is the least squares (minimum squared error) approximation of Y possible from X. So f(X) = E[Y|X] is the best predictor of Y, better than anything we could hope to do with regression, neural networks, anything linear, anything nonlinear, essentially anything at all.

For the proof, we start with something simple: We show that a = E[Y] minimizes E[(Y - a)^2]. That is, we are finding the single number a that best approximates real random variable Y in the least squares sense. Well, we have

E[(Y - a)^2]

= E[Y^2 - 2 Ya + a^2]

= E[Y^2] - 2aE[Y] + a^2

= E[Y^2] + E[Y]^2 - 2aE[Y] + a^2 - E[Y]^2

= E[Y^2] + (E[Y] - a)^2 - E[Y]^2

which we minimize with E[Y] = a.

Or, for one interpretation, the minimum rotational moment of inertia is for rotation about the center of mass.

So, for our main concern, suppose we want to use the data we have X to approximate Y. So, we want real valued function g with domain the possible values of X so that g(X) approximates Y.

For the most accurate approximation, we want to minimize

E[(Y - g(X))^2]

Claim: For g(X) we want

g(X) = E[Y|X]

So, g is the best non-linear least squares approximation to Y.

Proof:

We start by using one of the properties of conditional expectation and then continue with just simple algebra much as we did for E[(Y - a)^2] just above:

E[(Y - g(X))^2]

= E[ E[Y^2 - 2Yg(X) + g(X)^2|X] ]

= E[ E[Y^2|X] - 2g(X)E[Y|X]

+ g(X)^2 ]

= E[ E[Y^2|X] E[Y|X]^2 - 2g(X)E[Y|X]

+ g(X)^2 - E[Y|X]^2 ]

= E[ E[Y^2|X]

+ (E[Y|X] - g(X))^2

- E[Y|X]^2 ]

which is minimized with

g(X) = E[Y|X]

Done.

And how do we use this?

We just discretize X and use cross tabulation which directly estimates E[Y|X]. To be more clear, given real number x, to predict Y when X = x, we use cross tabulation to estimate E[Y|X = x] = f(x).

The good news is what we have proven: We have made the best prediction in the least squares sense of Y possible from our data on X.

The bad news is that the amount of data we need grows exponentially in the number of different components in X. I.e., we considered three variables in X, weight, gender, and cheerleader. The amount of data we need grows as the number of discrete values of X which grows exponentially in the number of components in X.

If somehow we KNOW that Y is a linear function of the components of X, then we might consider using regression. If we are doing just empirical curve fitting and have enough data, which in practice likely means relatively few components in X, then we can consider using just cross tabulation. But if X has thousands of components, then the exponential explosion in the amount of data required on X likely means that we will have to set aside cross tabulation, look for a way to use, maybe, a few dozen components of X, or go ahead and use the machine learning material in the Bloomberg course.

Uh, if are trying to predict people, there is an old hint -- people are no more than about 14 dimensional beings. That is, given 1000 variables on each of 10 million people, using principle components we should be able to reproduce the 1000 variables quite accurately using just 14 principle components obtained from the 1000 with just a linear transformation. Then use cross tabulation on the 14 variables -- maybe.

But we've been considering big data, right?

graycat · on July 13, 2018

Errata: "So, if you will, we are predicting random variable Y height from three random variable or three components of the one random variable X -- your choice."

should read

"So, if you will, we are predicting random variable Y height from three random variables or three components of the one random variable X -- your choice."

davidsrosenberg · on July 14, 2018

Nice you just provided the solution to Homework 1, Problem 3.1 (https://davidrosenberg.github.io/mlcourse/Homework/hw1.pdf).

graycat · on July 14, 2018

My solution never mentioned Bayes! So, don't have to "be a Bayesian" to use what I wrote.

I just used conditional expectation which, of course, with full details, is from the Radon-Nikodym theorem in measure theory as in, say, Rudin, Real and Complex Analysis with a nice, novel proof from von Neumann.

davidsrosenberg · on July 14, 2018

Yes, of course. A “Bayes prediction function” has nothing to do with Bayesian. Bayes had a lot of things named after him ;)

graycat · on July 14, 2018

Errata: "So, under meager assumptions that essentially always hold in practice f(X) is also a random variable."

should read

"Then f(X) is also a random variable."

davidsrosenberg · on July 14, 2018

You seem to have a preference for an approach in which you assume certain things are true about the world (e.g. y is a linear function of x), and then you derive some optimal prediction function, based on that assumption, under some definition of optimal. And that seems fine. In that example, you'd end up with the same results if you decided to restrict your search for a prediction function to a hypothesis space containing only linear functions -- not because you think the world obeys a linear function, but because you happen to like working with linear functions. I do agree that you can get insight into a method by knowing when it's the optimal method. We do talk about conditional probability models in Lecture 17, where we can assume that distribution of y has a specific form given x (although again we frame it as restricting your search space, rather than as an assumption about the world).

About the "throwing stuff against the wall until something appears to stick." First of all, I don't entirely object to this approach, in the sense that I don't think it's dangerous, so long as you follow the standard procedures of machine learning. And it's where somebody would be if, for example, they went through Lecture 1 on Black Box Machine Learning. But the other 29 lectures are building towards more than that. For example, in Lecture 6 and 7 we get a pretty careful understanding of how L1 and L2 regularization behave in different situations. In Lecture 19, we connect L1 and L2 regularization to various prior beliefs you have about the values of the true coefficients (in a Bayesian framework). We do examine the hypothesis spaces of trees, boosted trees, and neural networks, and consider the tradeoffs (piecewise constant vs smooth, trainable by gradient methods vs not). Yes, there is absolutely plenty of "just try it" in machine learning. Most of the theory of machine learning (generalization bounds, etc) is about when it's ok to estimate performance on future data based on performance on data you have now. We don't have to believe the world obeys a certain model for this to work, we only have to believe that the world will behave the same way in the future as it does now.

It's unfortunate that there wasn't more time in the class for factor analysis, although we do have a thorough treatment of the EM algorithm (Lecture 27), which is what you'd need for that. I used to give a similar argument about 'crosstab' and the curse of dimensionality (https://davidrosenberg.github.io/mlcourse/Archive/2015/Lectu...). What other methods would you have liked to see in the course? To scope it, the course is focused on making good prediction on future data.

graycat · on July 14, 2018

> We do talk about conditional probability models in Lecture 17, where we can assume that distribution of y has a specific form given x (although again we frame it as restricting your search space, rather than as an assumption about the world).

With what I wrote, just use cross tabulation justified as the best least squares estimate just from my short derivation based on conditional expectation. Don't have to mention Bayes.

> assume that distribution of y has a specific form given x

With what I wrote, don't have to do that. To be quite precise, do need to assume that random variable Y has an expectation, but in practice that is a meager assumption. Otherwise in what I wrote, essentially don't have to make any assumptions at all about distributions (e.g., do want to assume that Y has an expectation).

Here I have a remark to the statistics community, especially the beginning students and their teachers, a remark old for me: Yes, we have random variables. Yes, each random variable, even the multidimensional ones, has a cumulative distribution. Yes, it's often not too much to assume that we have a probability density. Yes, there are some important probability densities, exponential, Poisson, ..., especially Gaussian. Yes, some of these densities have some amazing properties -- e.g., for the Gaussian, sample mean and variance are sufficient statistics (see a famous 1940s paper by Halmos and Savage and a later paper on lack of stability by E. Dynkin).

So, from that, a suggestion accepted implicitly is that in work in applied statistics early on we should seek to find "the distribution". If the usual suspects don't fit, then we should try some flexible alternatives and try to make them fit.

If you want some spreadsheet program to draw you a histogram, okay. With a good program, you can get a histogram for two dimensional data. If you strain with your eyes and some tricky software, you have an outside shot at three dimensions. For more than three dimensions, you are in Death Valley in July out'a gas and walking -- not quite hopeless yet but not in a good situation.

Sorry, on finding the distribution, really, essentially NOPE -- don't try that, basically can't do that. Blunt reality check: Yes, distributions exist but, no, usually can't find them and, actually, mostly work without knowing them. Maybe all you know is that variance is finite! This situation holds mostly true for real valued random variables. For multivariate valued random variables, e.g., vector valued, as the dimension rises above 2, are getting into trouble; before the dimension gets to be a dozen, are lost at sea; anything above a few dozen are lost in deep outer space without a paddle and hopeless.

> We don't have to believe the world obeys a certain model for this to work, we only have to believe that the world will behave the same way in the future as it does now.

Yes, classic texts on regression and some of what J. Tukey wrote about how regression has commonly been used in practice should have made this point more clear and gone ahead and been more clear about over fitting, etc.

Also, with this approach, given some positive integer n and some data (y_i, x_i), i = 1, 2, ..., n, maybe if we keep looking, e.g., fit with ln(x_i), sin(x_i), exp(x_i), ..., eventually might get a good fit. Then do we believe that the system behaved this way in the past so with meager assumptions will do so in the future? I confess, this is a small, philosophical point.

Better points:

(A) Just from the course, for any model building, we have a dozen, maybe more, efforts at TIFO -- try it and find out. That's a bit wasteful of time -- right, automate that, have the computers do all the usual dozen. Hmm.

(B) We are so short on assumptions, we will have to be short on consequences of theorems about what what we have or might have or might do. Again, we are like Ptolemy instead of Newton.

Yes, I neglected to mention the famous and appropriate "curse of dimensionality".

> What other methods would you have liked to see in the course?

As I wrote, there's a lot on the shelves of the research libraries. No one course can cover all that might be useful for solving real problems.

So, another approach is first to pick the problem and then pick the solution techniques. Even for some solution techniques that might be, even have been, useful, tough to say that they should be in a course. So, maybe what should be in a course is a start, lots of powerful, general views, that enable, given a particular problem, picking a suitable solution technique. And, even with this broad approach, it might be productive to have some specializations.

I have a gut feeling that some of the computer science community took regression a bit too seriously.

Sure, I really like and respect Leo Breiman and can respect his work on classification and regression trees (do have his book but never bothered to use or get the software), but I like his text Probability published by SIAM a lot more, have it, have studied it, along with Neveu, Chung, Loeve, Skorohod, etc.

So, to suggest a prediction technique not much related to regression? Okay: Blue has some subsurface ballistic missile (SSBN) boats, i.e., submarines that while submerged can fire intercontinental ballistic missiles, each with several nuclear warheads that can be independently targeted.

These SSBNs hide in the oceans of the world. To help remain hidden, they execute some approximation to Brownian motion. E.g., even if Red gets some data on where a Blue SSBN is, a day or so later Red will have much less good data on where it is.

If Red can find an SSBN, then sinking it is easy enough -- just explode a nuke over it. But if sink one, then that starts a nuclear war, and the remaining SSBNs might shoot. So, Red could try to find and sink all the SSBNs at once -- finding them all, all at the same time, is unlikely, and mounting a search effort that would promise to do that would be provocative.

But, for considering world war scenarios, maybe there would be a nuclear war but limited just to sea. Then could sink SSBNs one at a time as find them. In this scenario, how long might they last?

Getting a good answer is your mission, and you have to accept it! Gee, we could try regression analysis, maybe neural networks?

Or: How to find things at sea was considered back in WWII -- that was an important question then, also. There was a report OEG-56 by B. Koopman. The report argued, with some clever applied math, that with an okay approximation the encounter rates were as in a Poisson process with the encounter rate a known algebraic function, in the report, in terms of the area of the ocean (for real oceans, neglect the details of the shape), speed of the target, speed of the searcher, and detection radius, all IIRC.

So, ..., can argue that for the war at sea, encounters are like various Poisson processes. Altogether the Red-Blue inventory declines like a multi-dimensional, continuous time, discrete state space (huge state space) Markov process subordinated to a Poisson process (for a good start, see the E. Cinlar Introduction to Stochastic Processes).

So, given the speeds, radii, and what happens -- one dies, other dies, both die, neither die -- one on one, Red-Blue encounters, have all the data needed.

Now with that data, we can make some predictions. Sure, there's a closed form solution, but the "curse of dimensionality" gets in the way. But, Monte Carlo and the strong law of large numbers to the rescue: Write software for a few days and run it for a few minutes, run off 1000 sample paths, average, and get the prediction with some decently good confidence intervals.

I did that. Later the work was sold to a leading intelligence agency. I could tell you which one, but then I'd have to ...!

"Look, Ma, no regression analysis!".

I've got some more, but this post is getting long!

Lesson: There's a lot more to applied statistics than empirical curve fitting.

davidsrosenberg · on July 14, 2018

Hehe ok —- I also love Breiman’s Probability book. It’s really a standout on Ergodic theory. And Breiman et al.’s book on Trees is surprisingly rich, talking about all sorts of stuff besides trees.

srean · on July 14, 2018

As much as I respect Brieman, I think he latched on too hard on his pet theory that all that ensembles do is reduce variance, and by doing so missed out on what boosting does.

Yeah, random forests work really well but they are layers and layers of hacks, thumb rules and intuition piled on top of the other. I cant claim with a straight face that any of them follows from solid principles.

Graycat and I have a history of discussing the differences between stats and ML here on HN. I just added a comment, up streams on the thread.

davidsrosenberg · on July 14, 2018

I imagine Breiman was just talking about bagging-style parallel ensembles, when he was talking about variance reduction, not boosting-style sequential ensembles. Not long before he died, he was still actively trying to figure out why AdaBoost “works”. Don’t think he claimed to really understand that. He had experimental results that disputed the “it’s just maximizing the margin” explanation.

Saw the comments above — are you from a stats or ML background, or neither?

srean · on July 15, 2018

I am more ML than stats. BTW Brieman believed the same for Boosting. Later he got a little unsure. You will find this in his writings on Boosting

davidsrosenberg · on July 16, 2018

Interesting -- if you've got a link, please post it.

srean · on July 16, 2018

You can take a look at his technical reports from 2001 onwards, lot of very interesting material, lot of back forth between Freund, Schapire and Brieman in those. The reports continue to be hosted on Brieman's home page.

graycat · on July 14, 2018

For ergodic theory, there were a few really good lectures, and I have good notes, in a course I took from a star student of E. Cinlar.

For Breiman Probability:

(1) I liked his start, nice, simple, intuitive, on how the law of large numbers works. Of course the super nice proof is from the martingale convergence theorem, but Breiman's start is nice.

(2) His book has the nicest start on diffusions I have seen.

(3) Once in some of my work I ran into a martingale. So, then, when I wanted to review martingale theory, I went to Breiman. Nice.

(4) Since I was interested in stochastic optimal control, I was interested in measurable selection and regular conditional distributions, and I got those from Breiman -- thanks Leo!!!! Yes, IIRC, there is the relevant Sierpinski exercise in Halmos, Measure Theory and also in Loeve, Probability, but it was Breiman who gave me what I really needed.

(5) Generally Breiman is easy to read. He writes like he is trying to help the reader learn.

Jasonseah · on July 13, 2018

I am new in machine learning and have difficulty to understand the equation. I finish 2 courses from Andrew Machine Learning Class from Coursera, I understand the flow, concept and knew how to write those equations/algorithm but it always bugs me that I don't understand those equations. Do you guys have any suggestion that where should I start for learning those equations/maths online?

siddboots · on July 13, 2018

If you want to keep up with the maths, the main prerequisites for more rigorous ML is typically undergrad-level calculus, linear algebra, and bit of basic probability and statistics.

For calculus, google "MIT 18.01", "MIT 18.02", (and "MIT 18.03" if you like), which are all freely available on youtube. You should be comfortable with single-variable calculus, and at least familiar with multi-variable techniques.

For linear algebra, try "MIT 18.06", which is Gilbert Strang's MIT course. Or try 3blue1brown's "The essence of linear algebra" series, which is the best explanation I've ever seen of many concepts, but it is shorter and less in-depth than a full course.

For basic statistics, try Khan Academy's "AP Statistics" sequence.

colmvp · on July 13, 2018

While I think 3b1b video's are great for developing a better understanding of the purpose of certain concepts, I don't think there's any way around ignore manually doing the problems with say Strangs Linear Algebra book since that's where you spend tim trying to apply these concepts to problems.

siddboots · on July 13, 2018

Agreed. Nothing beats practice when it comes to math. I know that I spend far too much time watching course videos, and no where near enough working through problem sets.

id_rsa · on July 13, 2018

Working through some of Strang's problems has also helped me. 3blue1brown is a great introduction to give you intuition, but you cannot commit the skills to long-term memory without struggling through problems.

Jasonseah · on July 13, 2018

Thx! will look into it!

hackermailman · on July 13, 2018

Depends what your math level is. Start with https://www.expii.com/ which is Po-Shen Loh's site and see where you are. If you can do the algebra and basic calculus problem sets there you could also try these specific ML background books https://sites.google.com/site/10715advancedmlintro2017f/read...

Expii is a great resource because it determines from your success solving problems if you need more practice to get a concept

Jasonseah · on July 13, 2018

Thanks for the suggestion! Help a lot for what I am struggling now.

sherjilozair · on July 13, 2018

What is your math experience/background? In any case, for the average case, this course is probably the best for those who want a slightly deeper understanding of machine learning than offered by Andrew Ng's courses: https://work.caltech.edu/lectures.html

davidsrosenberg · on July 14, 2018

I also really like the Abu-Mostafa course from caltech you link, and their book. If you want to get a taste of generalization bounds and statistical learning theory (e.g. VC dimension), he gives the gentlest introduction I've seen.

Jasonseah · on July 13, 2018

can say at a very basic level ... I don't really focus on the study at my early age. I still can understand the equation but not really know why and how to implement it, and come to a very complex situation I can't make sense on the calculation ... Every calculation that Andrew Ng teach are new to me even for the matrix, vector, and sum over are new to me. Even tho Andrew Ng did say "if you don't understand math is ok." but I still love to understand math for my machine learning career, that's why I seek for help.

sherjilozair · on July 13, 2018

You need to pick up linear algebra and basic probability.

CS229's linear algebra review is pretty good: http://cs229.stanford.edu/section/cs229-linalg.pdf

For more depth, Trefthen and Bau's Numerical Linear Algebra.

For a video course, Gilbert Strang's MIT OCW course on Linear Algebra.

Review of probability: https://see.stanford.edu/materials/aimlcs229/cs229-prob.pdf

ghaff · on July 13, 2018

Some intuition about minima (gradual descent), linear regression, etc. will get you to a certain level--which admittedly may be a bit hard if you've never had the math. But as others have suggested, you probably need to equivalent of a few semesters of calculus and linear algebra (which I never took in a formal way though I did some variants of pre-MatLab).

Jasonseah · on July 13, 2018

I did try to play around with Octave / MatLab to play with the calculation but not in a very comfortable way ...

graycat · on July 14, 2018

Several of the comments in this thread clarify what the Bloomberg machine learning work is about. Then with the clarification, maybe I see some issues with the work.

We can start with just the simplest case, ordinary, plain old regression with just one independent variable. So, for some positive integer n, we have pairs of real numbers (y_i, x_i) for i = 1, 2, ..., n. Here we are trying to predict the y_i from the x_i; we are trying to build a model that will predict y from a corresponding x where x is not in the data. Our model is to be a straight line, e.g., in high school form

y = ax + b

Okay, doing this, with the classic assumptions, we can draw a graph of the data, the fitted line, and the confidence (prediction) interval at each real number x.

For some intuition, suppose the x_i are all between 0 and 5. Maybe then the fit is quite good, the line is close to the data, and the confidence intervals are small.

But, IIRC, roughly or exactly, the confidence interval curves are actually hyperbolas. So, while the upper and lower curves are close for x between 0 and 5, for x outside the interval [0,5] the curves can grow far apart. So, if we are interested in the predictions of the model for, say, x = 20, the confidence intervals may be very wide, enough for us to conclude that, even though we have a straight line that fits our data closely, still our model is useless at x >= 20.

So, this little example illustrates a broad point about such curve fitting: The model might work well for independent variable x (likely a vector of several components) close to the training data but commonly be awful otherwise.

How serious this situation is can vary a lot depending on the application. E.g., if are interested only in the values of y for x a lot like the training data, e.g., in the interval [0,5], then maybe don't are about the y value or the confidence interval when x = 20.

But quite broadly there are applications where what we want such models to tell us is the value of y for some x not close to what we have seen in our training data.

Uh, let's see: IIRC Bloomberg is selling stock market and economic data, often in nearly real time, to investors, some of whom are traders and make trades quickly, within a few seconds, based on the data from Bloomberg.

I'm no up to date expert on just what the Bloomberg customers are doing, but from 20,000 feet or so up maybe the situation is something like, broadly the investors vary:

(A) Some investors want a portfolio constructed much like in the work of H. Markowitz or W. Sharpe. In simple terms, they want the portfolio to have good expected return with low standard deviation of return and maybe, then, buy on margin to raise the rate of return while still having the risk relatively low.

(B) Some investors are interested in relationships between stocks and options -- e.g., the Black-Scholes work is an example of this. IIRC, a more general case is some stochastic process, maybe Brownian motion, reaching a boundary and a first exit. The exit has a value, so the investment problem is a boundary value problem. IIRC can design and attempt to evaluate exotic options with such ideas.

(C) Some investors are just stock pickers and buy when they sense a sudden rise in price.

But a theme in (A)-(C) is that the investors are looking for something unusual. So, in the model building, the unusual may have been unusual in the training data and the testing data. In that case, without more assumptions, theories, or whatever the prediction of the model for unusual input data may be poor.

That is, it appears that the model building techniques promise that the model will do poorly in just the application cases of greatest interest to investors -- the unusual cases that are not well represented in the training and test data.

So, maybe first cut some of what is needed is some anomaly detection.

So, we could use more information about the systems we are trying to model. A linearity assumption is one such. In Newton's second law and law of gravity, we can check that for falling apples. Next we can try on the planets in our solar system and, nicely enough, see that it works. And then we can be pretty sure for a rocket at Mach 15 headed for stationary orbit, etc. But with just empirical curve fitting, apparently mostly we don't have such additional information.

IIRC L. Breiman's first interest in empirical curve fitting was for clinical medical data. So, maybe in that data he was trying to predict some disease but the independent variable data he was using was common in his training and testing data. I.e., he wasn't really looking for exploiting some anomaly for some once in 20 years way to get rich quick.

davidsrosenberg · on July 15, 2018

So the extrapolation-type problem you describe (an input not near any of your training examples) is an issue. Unless you have a world model you believe in (i.e. you've done some science -- not just statistics), hard to know if your prediction function works out there where you’ve never seen any examples. If you’ve seen some data out there, but relatively fewer than you see in deployment, then importance weighting or other approaches from covariate shift / domain adaptation could help.

Anomaly detection is definitely another important area, but I struggle to pull together a coherent unit on the topic. One issue is that it’s difficult to define precisely, at least partly because everybody means something a little bit different by it.

Also, based on classical hypothesis testing, I think that to some extent you have to know what you’re looking for to be able to detect it (ie to have power against the alternatives/anomalies you care about)... For that reason, I think it’s hard to separate anomaly detection from more general risk analysis/assessment, because you need to know the type of thing you care about finding.

In any case, I made an attempt on anomaly detection: There's https://bloomberg.github.io/foml/#lecture-15-citysense-proba... which is simply about building a conditional probability model, and flagging behavior as anomalous if it has low probability or prob density under the model. I also used to have 1-class SVM’s in a homework (https://davidrosenberg.github.io/mlcourse/Archive/2017/Homew... Problem 11).

graycat · on July 16, 2018

So, for anomaly detection, before evaluating the model at x, might want to know if x would be an anomaly in the training data x_i, i = 1. 2, ..., n. Sure, x is likely a vector with several to many components.

An anomaly detector should be at least as good as a statistical hypothesis test.

So, for the null hypothesis, assume that x is distributed like the training data.

Okay, except we don't really know the distribution of the training data.

"Ma! Help! What am I supposed to do now???"

So, we need a statistical hypothesis test that is both multi-dimensional and distribution-free.

Let's, see: In ergodic theory we consider transformations that are measure preserving .... Yup, can have a group (as in abstract algebra) of those, sum over the group, ..., and calculate the significance level of the test and, thus, get a real hypothesis test, multi-dimensional and distribution free. For some of the details of the test, there are lots of variations, i.e., options, knobs to turn.

Detection rate? Hmm. Depends ...! Don't have data enough to use the Neyman-Person approach, but in a curious but still relevant sense the detection rate is the highest possible.

I just call this work statistics, but maybe it would also qualify according to some definitions as machine learning. But my work is not merely heuristic and has nothing to do with regression analysis or neural networks. So, again, my work is an example that there can be more to machine learning than empirical curve fitting.

So, before applying an empirically fitted model at x, want x to be distributed like the training data and at least want an hypothesis test not to reject the null hypothesis that x is so distributed.

More generally, if are looking for anomalies in the data, say, a rapid real time stream, when see an anomaly, investigate further. In this case, an anomaly detector is a first cut filter, an alarm, to justify further investigation.

Looking back on what I did, I suspect that more could be done and that some of what I did could be done better.

Of course, my interests now are my startup. Yes, there the crucial core is some applied math I derived.

Maybe I'll use my anomaly detection work for real-time monitoring for zero-day problems in security, performance, failures, etc. in my server farm.

graycat · on July 16, 2018

As a very general but crude and blunt approach to show that the hypotheses tests were not trivial, used the result of S. Ulam that Le Cam called "tightness" as in P. Billingsley, Convergence of Probability Measures. When doing both multi-dimensional and distrbution-free, are nearly way out in the ozone so get pushed into some abstract techniques! Meow!

graycat · on July 20, 2018

Okay, I started watching your video lectures at your 15 City Sense Data and through your application of maximum likelihood estimation.

So, you drag out some common distributions, e.g., Poisson, negative binomial, beta, and look for the best "fit" for your data. Here you have little or no reason to believe that the data has any of those distributions. So, the work is all rules of thumb, intuitive heuristics, crude considerations, e.g., discrete or continuous and what the "support" is, etc. Then the quality is added on later -- see if the results are good on the test data. If not, then try again with more heuristics. If so, then use the "model" in practice. Cute. Apparently you are calling this two step technique, (i) use heuristics and then (ii) validate with the training data the influence of "machine learning". Okay.

But there is a more traditional approach: When we use, e.g., the Poisson distribution, it is because we have good reason to believe that the data really is Poisson distributed. Maybe we just estimate the one parameter of the distribution and then continue on. It will be good also to check with some test data, but really, logically it is not necessary. Moreover, we may have some predictions for which we have no test data.

An example of this traditional approach where we have no test data and, really, no "training" data in the sense you are assuming, is the work I outlined for predicting the lifetime of submarines in a special scenario of global nuclear war limited to sea. There we got a Poisson process from some work of Koopmans. But we had no data on sinking submarines in real wars, not for "training" or "testing". We can similar things for, say, arrivals at a Web site because from the renewal theorem we have good reason to believe that the arrivals, over, say, 10 minutes at a time, form a Poisson process. Then we could estimate the arrival rate and, if we wish, do an hypothesis test for anomalies where the null hypothesis was that the arrivals were as in our Poisson process, all without any heuristic fitting and detecting anomalies that were not in any data we had observed so far.

So, generally we have two approaches:

(1) Machine learning where we get test data, use heuristics to build a model, test the model on test data, and declare success when we pass the test.

(2) Math assumptions where we have reason to know the distributions and form of a model, and use possibly much less data to determine the relatively few free parameters in the model and then just apply the model (testing also if we have some suitable test data -- in the submarine model, we did not).

A problem for approach (1) machine learning is that we should be fairly sure that when we apply the model at point x, point x is distributed as in the training data. So, essentially we are just interpolating.

E.g., if we have positive integers n and d, the set of real numbers R, training data pairs (y_i, x_i) for i = 1, 2, ..., n where y_i in R and x_i in R^d, fit a model that predicts each y_i from each_x_i, then to apply the model at x in R^d, we should know that x is distributed like the x_i.

So, we are assuming that the x_i have a "distribution", that is, are independent, identically distributed (i.i.d.) "samples" from some distribution. And our model is a function f: R^d --> R. Then to apply our model a x in R^d we evaluate f(x). We hope that our f(x) will be a good approximation of the y in R that would correspond to our x in R^d. So, again, to do this, we should check if x in R^d is distributed like the x_i in R^d.

I did see your work on anomaly detection, but it was much simpler and really not in the same context as an hypothesis test with null hypothesis that the x and x_i are i.i.d. in R^d.

I can believe that at times approach (1) can work and be useful in practice, but (A) a lot of data is needed (e.g., can't apply this approach to the problem of the submarine lifetimes) and (B) are limited in the values of x that can be used in the model (can't do an hypothesis test to detect anomalous arrivals beyond any seen before at a Web site).

From all I could see in your lectures, you neglected approach (2); readers should be informed that something like (2) is missing.