davidsrosenberg's comments

davidsrosenberg · on July 15, 2018

So the extrapolation-type problem you describe (an input not near any of your training examples) is an issue. Unless you have a world model you believe in (i.e. you've done some science -- not just statistics), hard to know if your prediction function works out there where you’ve never seen any examples. If you’ve seen some data out there, but relatively fewer than you see in deployment, then importance weighting or other approaches from covariate shift / domain adaptation could help.

Anomaly detection is definitely another important area, but I struggle to pull together a coherent unit on the topic. One issue is that it’s difficult to define precisely, at least partly because everybody means something a little bit different by it.

Also, based on classical hypothesis testing, I think that to some extent you have to know what you’re looking for to be able to detect it (ie to have power against the alternatives/anomalies you care about)... For that reason, I think it’s hard to separate anomaly detection from more general risk analysis/assessment, because you need to know the type of thing you care about finding.

In any case, I made an attempt on anomaly detection: There's https://bloomberg.github.io/foml/#lecture-15-citysense-proba... which is simply about building a conditional probability model, and flagging behavior as anomalous if it has low probability or prob density under the model. I also used to have 1-class SVM’s in a homework (https://davidrosenberg.github.io/mlcourse/Archive/2017/Homew... Problem 11).

graycat · on July 16, 2018

So, for anomaly detection, before evaluating the model at x, might want to know if x would be an anomaly in the training data x_i, i = 1. 2, ..., n. Sure, x is likely a vector with several to many components.

An anomaly detector should be at least as good as a statistical hypothesis test.

So, for the null hypothesis, assume that x is distributed like the training data.

Okay, except we don't really know the distribution of the training data.

"Ma! Help! What am I supposed to do now???"

So, we need a statistical hypothesis test that is both multi-dimensional and distribution-free.

Let's, see: In ergodic theory we consider transformations that are measure preserving .... Yup, can have a group (as in abstract algebra) of those, sum over the group, ..., and calculate the significance level of the test and, thus, get a real hypothesis test, multi-dimensional and distribution free. For some of the details of the test, there are lots of variations, i.e., options, knobs to turn.

Detection rate? Hmm. Depends ...! Don't have data enough to use the Neyman-Person approach, but in a curious but still relevant sense the detection rate is the highest possible.

I just call this work statistics, but maybe it would also qualify according to some definitions as machine learning. But my work is not merely heuristic and has nothing to do with regression analysis or neural networks. So, again, my work is an example that there can be more to machine learning than empirical curve fitting.

So, before applying an empirically fitted model at x, want x to be distributed like the training data and at least want an hypothesis test not to reject the null hypothesis that x is so distributed.

More generally, if are looking for anomalies in the data, say, a rapid real time stream, when see an anomaly, investigate further. In this case, an anomaly detector is a first cut filter, an alarm, to justify further investigation.

Looking back on what I did, I suspect that more could be done and that some of what I did could be done better.

Of course, my interests now are my startup. Yes, there the crucial core is some applied math I derived.

Maybe I'll use my anomaly detection work for real-time monitoring for zero-day problems in security, performance, failures, etc. in my server farm.

graycat · on July 16, 2018

As a very general but crude and blunt approach to show that the hypotheses tests were not trivial, used the result of S. Ulam that Le Cam called "tightness" as in P. Billingsley, Convergence of Probability Measures. When doing both multi-dimensional and distrbution-free, are nearly way out in the ozone so get pushed into some abstract techniques! Meow!

graycat · on July 20, 2018

Okay, I started watching your video lectures at your 15 City Sense Data and through your application of maximum likelihood estimation.

So, you drag out some common distributions, e.g., Poisson, negative binomial, beta, and look for the best "fit" for your data. Here you have little or no reason to believe that the data has any of those distributions. So, the work is all rules of thumb, intuitive heuristics, crude considerations, e.g., discrete or continuous and what the "support" is, etc. Then the quality is added on later -- see if the results are good on the test data. If not, then try again with more heuristics. If so, then use the "model" in practice. Cute. Apparently you are calling this two step technique, (i) use heuristics and then (ii) validate with the training data the influence of "machine learning". Okay.

But there is a more traditional approach: When we use, e.g., the Poisson distribution, it is because we have good reason to believe that the data really is Poisson distributed. Maybe we just estimate the one parameter of the distribution and then continue on. It will be good also to check with some test data, but really, logically it is not necessary. Moreover, we may have some predictions for which we have no test data.

An example of this traditional approach where we have no test data and, really, no "training" data in the sense you are assuming, is the work I outlined for predicting the lifetime of submarines in a special scenario of global nuclear war limited to sea. There we got a Poisson process from some work of Koopmans. But we had no data on sinking submarines in real wars, not for "training" or "testing". We can similar things for, say, arrivals at a Web site because from the renewal theorem we have good reason to believe that the arrivals, over, say, 10 minutes at a time, form a Poisson process. Then we could estimate the arrival rate and, if we wish, do an hypothesis test for anomalies where the null hypothesis was that the arrivals were as in our Poisson process, all without any heuristic fitting and detecting anomalies that were not in any data we had observed so far.

So, generally we have two approaches:

(1) Machine learning where we get test data, use heuristics to build a model, test the model on test data, and declare success when we pass the test.

(2) Math assumptions where we have reason to know the distributions and form of a model, and use possibly much less data to determine the relatively few free parameters in the model and then just apply the model (testing also if we have some suitable test data -- in the submarine model, we did not).

A problem for approach (1) machine learning is that we should be fairly sure that when we apply the model at point x, point x is distributed as in the training data. So, essentially we are just interpolating.

E.g., if we have positive integers n and d, the set of real numbers R, training data pairs (y_i, x_i) for i = 1, 2, ..., n where y_i in R and x_i in R^d, fit a model that predicts each y_i from each_x_i, then to apply the model at x in R^d, we should know that x is distributed like the x_i.

So, we are assuming that the x_i have a "distribution", that is, are independent, identically distributed (i.i.d.) "samples" from some distribution. And our model is a function f: R^d --> R. Then to apply our model a x in R^d we evaluate f(x). We hope that our f(x) will be a good approximation of the y in R that would correspond to our x in R^d. So, again, to do this, we should check if x in R^d is distributed like the x_i in R^d.

I did see your work on anomaly detection, but it was much simpler and really not in the same context as an hypothesis test with null hypothesis that the x and x_i are i.i.d. in R^d.

I can believe that at times approach (1) can work and be useful in practice, but (A) a lot of data is needed (e.g., can't apply this approach to the problem of the submarine lifetimes) and (B) are limited in the values of x that can be used in the model (can't do an hypothesis test to detect anomalous arrivals beyond any seen before at a Web site).

From all I could see in your lectures, you neglected approach (2); readers should be informed that something like (2) is missing.

davidsrosenberg · on July 14, 2018

I also really like the Abu-Mostafa course from caltech you link, and their book. If you want to get a taste of generalization bounds and statistical learning theory (e.g. VC dimension), he gives the gentlest introduction I've seen.

davidsrosenberg · on July 14, 2018

Funny - I had the same thoughts and Boyd and Vandenberghe’s book, which is why I compiled this “extreme abridgment” for what you need for the class: https://davidrosenberg.github.io/mlcourse/Notes/convex-optim...

iampno · on July 27, 2018

davidsrosenberg · on July 14, 2018

Nice you just provided the solution to Homework 1, Problem 3.1 (https://davidrosenberg.github.io/mlcourse/Homework/hw1.pdf).

graycat · on July 14, 2018

My solution never mentioned Bayes! So, don't have to "be a Bayesian" to use what I wrote.

I just used conditional expectation which, of course, with full details, is from the Radon-Nikodym theorem in measure theory as in, say, Rudin, Real and Complex Analysis with a nice, novel proof from von Neumann.

davidsrosenberg · on July 14, 2018

Yes, of course. A “Bayes prediction function” has nothing to do with Bayesian. Bayes had a lot of things named after him ;)

davidsrosenberg · on July 14, 2018

You seem to have a preference for an approach in which you assume certain things are true about the world (e.g. y is a linear function of x), and then you derive some optimal prediction function, based on that assumption, under some definition of optimal. And that seems fine. In that example, you'd end up with the same results if you decided to restrict your search for a prediction function to a hypothesis space containing only linear functions -- not because you think the world obeys a linear function, but because you happen to like working with linear functions. I do agree that you can get insight into a method by knowing when it's the optimal method. We do talk about conditional probability models in Lecture 17, where we can assume that distribution of y has a specific form given x (although again we frame it as restricting your search space, rather than as an assumption about the world).

About the "throwing stuff against the wall until something appears to stick." First of all, I don't entirely object to this approach, in the sense that I don't think it's dangerous, so long as you follow the standard procedures of machine learning. And it's where somebody would be if, for example, they went through Lecture 1 on Black Box Machine Learning. But the other 29 lectures are building towards more than that. For example, in Lecture 6 and 7 we get a pretty careful understanding of how L1 and L2 regularization behave in different situations. In Lecture 19, we connect L1 and L2 regularization to various prior beliefs you have about the values of the true coefficients (in a Bayesian framework). We do examine the hypothesis spaces of trees, boosted trees, and neural networks, and consider the tradeoffs (piecewise constant vs smooth, trainable by gradient methods vs not). Yes, there is absolutely plenty of "just try it" in machine learning. Most of the theory of machine learning (generalization bounds, etc) is about when it's ok to estimate performance on future data based on performance on data you have now. We don't have to believe the world obeys a certain model for this to work, we only have to believe that the world will behave the same way in the future as it does now.

It's unfortunate that there wasn't more time in the class for factor analysis, although we do have a thorough treatment of the EM algorithm (Lecture 27), which is what you'd need for that. I used to give a similar argument about 'crosstab' and the curse of dimensionality (https://davidrosenberg.github.io/mlcourse/Archive/2015/Lectu...). What other methods would you have liked to see in the course? To scope it, the course is focused on making good prediction on future data.

graycat · on July 14, 2018

> We do talk about conditional probability models in Lecture 17, where we can assume that distribution of y has a specific form given x (although again we frame it as restricting your search space, rather than as an assumption about the world).

With what I wrote, just use cross tabulation justified as the best least squares estimate just from my short derivation based on conditional expectation. Don't have to mention Bayes.

> assume that distribution of y has a specific form given x

With what I wrote, don't have to do that. To be quite precise, do need to assume that random variable Y has an expectation, but in practice that is a meager assumption. Otherwise in what I wrote, essentially don't have to make any assumptions at all about distributions (e.g., do want to assume that Y has an expectation).

Here I have a remark to the statistics community, especially the beginning students and their teachers, a remark old for me: Yes, we have random variables. Yes, each random variable, even the multidimensional ones, has a cumulative distribution. Yes, it's often not too much to assume that we have a probability density. Yes, there are some important probability densities, exponential, Poisson, ..., especially Gaussian. Yes, some of these densities have some amazing properties -- e.g., for the Gaussian, sample mean and variance are sufficient statistics (see a famous 1940s paper by Halmos and Savage and a later paper on lack of stability by E. Dynkin).

So, from that, a suggestion accepted implicitly is that in work in applied statistics early on we should seek to find "the distribution". If the usual suspects don't fit, then we should try some flexible alternatives and try to make them fit.

If you want some spreadsheet program to draw you a histogram, okay. With a good program, you can get a histogram for two dimensional data. If you strain with your eyes and some tricky software, you have an outside shot at three dimensions. For more than three dimensions, you are in Death Valley in July out'a gas and walking -- not quite hopeless yet but not in a good situation.

Sorry, on finding the distribution, really, essentially NOPE -- don't try that, basically can't do that. Blunt reality check: Yes, distributions exist but, no, usually can't find them and, actually, mostly work without knowing them. Maybe all you know is that variance is finite! This situation holds mostly true for real valued random variables. For multivariate valued random variables, e.g., vector valued, as the dimension rises above 2, are getting into trouble; before the dimension gets to be a dozen, are lost at sea; anything above a few dozen are lost in deep outer space without a paddle and hopeless.

> We don't have to believe the world obeys a certain model for this to work, we only have to believe that the world will behave the same way in the future as it does now.

Yes, classic texts on regression and some of what J. Tukey wrote about how regression has commonly been used in practice should have made this point more clear and gone ahead and been more clear about over fitting, etc.

Also, with this approach, given some positive integer n and some data (y_i, x_i), i = 1, 2, ..., n, maybe if we keep looking, e.g., fit with ln(x_i), sin(x_i), exp(x_i), ..., eventually might get a good fit. Then do we believe that the system behaved this way in the past so with meager assumptions will do so in the future? I confess, this is a small, philosophical point.

Better points:

(A) Just from the course, for any model building, we have a dozen, maybe more, efforts at TIFO -- try it and find out. That's a bit wasteful of time -- right, automate that, have the computers do all the usual dozen. Hmm.

(B) We are so short on assumptions, we will have to be short on consequences of theorems about what what we have or might have or might do. Again, we are like Ptolemy instead of Newton.

Yes, I neglected to mention the famous and appropriate "curse of dimensionality".

> What other methods would you have liked to see in the course?

As I wrote, there's a lot on the shelves of the research libraries. No one course can cover all that might be useful for solving real problems.

So, another approach is first to pick the problem and then pick the solution techniques. Even for some solution techniques that might be, even have been, useful, tough to say that they should be in a course. So, maybe what should be in a course is a start, lots of powerful, general views, that enable, given a particular problem, picking a suitable solution technique. And, even with this broad approach, it might be productive to have some specializations.

I have a gut feeling that some of the computer science community took regression a bit too seriously.

Sure, I really like and respect Leo Breiman and can respect his work on classification and regression trees (do have his book but never bothered to use or get the software), but I like his text Probability published by SIAM a lot more, have it, have studied it, along with Neveu, Chung, Loeve, Skorohod, etc.

So, to suggest a prediction technique not much related to regression? Okay: Blue has some subsurface ballistic missile (SSBN) boats, i.e., submarines that while submerged can fire intercontinental ballistic missiles, each with several nuclear warheads that can be independently targeted.

These SSBNs hide in the oceans of the world. To help remain hidden, they execute some approximation to Brownian motion. E.g., even if Red gets some data on where a Blue SSBN is, a day or so later Red will have much less good data on where it is.

If Red can find an SSBN, then sinking it is easy enough -- just explode a nuke over it. But if sink one, then that starts a nuclear war, and the remaining SSBNs might shoot. So, Red could try to find and sink all the SSBNs at once -- finding them all, all at the same time, is unlikely, and mounting a search effort that would promise to do that would be provocative.

But, for considering world war scenarios, maybe there would be a nuclear war but limited just to sea. Then could sink SSBNs one at a time as find them. In this scenario, how long might they last?

Getting a good answer is your mission, and you have to accept it! Gee, we could try regression analysis, maybe neural networks?

Or: How to find things at sea was considered back in WWII -- that was an important question then, also. There was a report OEG-56 by B. Koopman. The report argued, with some clever applied math, that with an okay approximation the encounter rates were as in a Poisson process with the encounter rate a known algebraic function, in the report, in terms of the area of the ocean (for real oceans, neglect the details of the shape), speed of the target, speed of the searcher, and detection radius, all IIRC.

So, ..., can argue that for the war at sea, encounters are like various Poisson processes. Altogether the Red-Blue inventory declines like a multi-dimensional, continuous time, discrete state space (huge state space) Markov process subordinated to a Poisson process (for a good start, see the E. Cinlar Introduction to Stochastic Processes).

So, given the speeds, radii, and what happens -- one dies, other dies, both die, neither die -- one on one, Red-Blue encounters, have all the data needed.

Now with that data, we can make some predictions. Sure, there's a closed form solution, but the "curse of dimensionality" gets in the way. But, Monte Carlo and the strong law of large numbers to the rescue: Write software for a few days and run it for a few minutes, run off 1000 sample paths, average, and get the prediction with some decently good confidence intervals.

I did that. Later the work was sold to a leading intelligence agency. I could tell you which one, but then I'd have to ...!

"Look, Ma, no regression analysis!".

I've got some more, but this post is getting long!

Lesson: There's a lot more to applied statistics than empirical curve fitting.

davidsrosenberg · on July 14, 2018

Hehe ok —- I also love Breiman’s Probability book. It’s really a standout on Ergodic theory. And Breiman et al.’s book on Trees is surprisingly rich, talking about all sorts of stuff besides trees.

srean · on July 14, 2018

As much as I respect Brieman, I think he latched on too hard on his pet theory that all that ensembles do is reduce variance, and by doing so missed out on what boosting does.

Yeah, random forests work really well but they are layers and layers of hacks, thumb rules and intuition piled on top of the other. I cant claim with a straight face that any of them follows from solid principles.

Graycat and I have a history of discussing the differences between stats and ML here on HN. I just added a comment, up streams on the thread.

davidsrosenberg · on July 14, 2018

I imagine Breiman was just talking about bagging-style parallel ensembles, when he was talking about variance reduction, not boosting-style sequential ensembles. Not long before he died, he was still actively trying to figure out why AdaBoost “works”. Don’t think he claimed to really understand that. He had experimental results that disputed the “it’s just maximizing the margin” explanation.

Saw the comments above — are you from a stats or ML background, or neither?

srean · on July 15, 2018

I am more ML than stats. BTW Brieman believed the same for Boosting. Later he got a little unsure. You will find this in his writings on Boosting

davidsrosenberg · on July 16, 2018

Interesting -- if you've got a link, please post it.

srean · on July 16, 2018

You can take a look at his technical reports from 2001 onwards, lot of very interesting material, lot of back forth between Freund, Schapire and Brieman in those. The reports continue to be hosted on Brieman's home page.

graycat · on July 14, 2018

For ergodic theory, there were a few really good lectures, and I have good notes, in a course I took from a star student of E. Cinlar.

For Breiman Probability:

(1) I liked his start, nice, simple, intuitive, on how the law of large numbers works. Of course the super nice proof is from the martingale convergence theorem, but Breiman's start is nice.

(2) His book has the nicest start on diffusions I have seen.

(3) Once in some of my work I ran into a martingale. So, then, when I wanted to review martingale theory, I went to Breiman. Nice.

(4) Since I was interested in stochastic optimal control, I was interested in measurable selection and regular conditional distributions, and I got those from Breiman -- thanks Leo!!!! Yes, IIRC, there is the relevant Sierpinski exercise in Halmos, Measure Theory and also in Loeve, Probability, but it was Breiman who gave me what I really needed.

(5) Generally Breiman is easy to read. He writes like he is trying to help the reader learn.

davidsrosenberg · on July 14, 2018

Here are some of the things that I think are distinctive about the class (although certainly all of these are taught in some other class somewhere): discussion of approximation error, estimation error, and optimization error, rather than the more vague “bias / variance” trade off (common in more theoretical classes, such as Mohri's, but not common in practically-oriented classes); full treatment of gradient boosting, one of the most successful ML algorithms in use today (most courses stop with AdaBoost, which is rarely used for reasons discussed in the course); much more emphasis on conditional probability modeling than is typical (you give me an input, I give you a probability distribution over outcomes — useful for anomaly detection and prediction intervals, among other things), explanation (including pictures) for what happens with ridge, lasso, and elastic net in the [very common in practice] case of correlated features (I haven't seen this in another class, though I'm sure others have done it, somewhere); guided derivation (in homework) of when the penalty forms and constraint forms of regularization are equivalent, which gives you much more flexibility in how you formulate your problem (I know this is uncommon because it took me a very long time to find a reference), multiclass classification as an introduction to structure prediction (idea taken from Shalev-Shwartz and Ben-David's book). On the homework side, you’d code neural networks in a computation graph framework written from scratch in numpy; well, we're pretty relentless about implementing almost every major ML method we discuss from scratch in the homework.

davidsrosenberg · on July 14, 2018

This course is complementary to Mohri's excellent book and course. Many students at NYU take both courses, in either order (https://davidrosenberg.github.io/ml2018/ and https://cs.nyu.edu/~mohri/ml17/). Mohri's course builds a foundation for proving performance guarantees (yes, using tools such as VCdim and Rademacher complexity). This course tries to be practical, but not superficial. We do a careful study of multiple examples of the four fundamental components of an ML method: loss function, regularization, hypothesis space, and optimization method. (In a probabilistic setting, regularization becomes a prior and loss becomes a likelihood.) Framed in this way, it's usually much easier to understand or invent new methods. And within this framework, we absolutely try to survey as many methods as we have time to look at carefully.

stochastic_monk · on July 14, 2018

Hi David! Thank you for your thorough response. I appreciate it. And I understand why one would choose to cover as much breadth as you do. I don’t think there’s an easy way to cover all the basics without feeling like a survey, but keeping track of core concepts across applications was what helped me understand how they were connected.