Statistical Data Mining Tutorials (2005)

graycat · on June 6, 2015

Some of his math notation is not so good.

His 22 slides on game theory go on and on but are not clear on just the really simple solution: It's just a really simple linear programming problem. Could knock it off on one slide, two or three if wanted to be verbose. I did that when I taught linear programming in college and an MBA program.

More generally, a large fraction of these topics and a larger fraction of the more basic tools are what was long called the mathematical sciences, where generally the work was done more carefully, and, in particular, the mathematics of operations research along with, sure, and pure and applied statistics.

He ends up with genetic algorithms and simulated annealing. Gee, I encountered such a problem only once: Some guys had a resource allocation problem and formulated it as a 0-1 integer linear program with 40,000 constraints and 600,000 variables. They had tried simulated annealing, ran for days, and stopped with results with objective function value of unknown distance from the optimal value.

I saw an approach via Lagrangian relaxation, which really needs most of a nice course in optimization, wrote some software, and got a feasible solution with objective function value guaranteed to be within 0.025% of optimality. My software ran for 905 seconds on an old 90 MHz PC.

For the bound of 0.025%, Lagrangian relaxation has, on the optimal value of the objective function, both a lower bound and an upper bound and, during the relaxation, lowers the upper bound and raises the lower bound. When the two bounds are close enough for the context, then take the best feasible solution so far and call the work done.

I'd type in the basics here except I'd really need TeX.

The resource allocation problem was optimization, just optimization, and needed just some of what had long been known in optimization. Simulated annealing didn't look very good, and it wasn't.

Optimization, going back to mathematical programming, unconstrained, constrained, the Kuhn-Tucker conditions, linear programming, network linear programming, integer programming, dynamic programming, etc. were well developed fields starting in the late 1940s with a lot of work rock solid by 1980.

Good work has come from Princeton, Johns Hopkins, Cornell, Waterloo, Georgia Tech, University of Washington, etc.

tomnipotent · on June 6, 2015

It's hard to take your reply seriously as it comes across like a pissing contest. Post is from 2005, and the author is Andrew W. Moore - Dean of CompSci at Carnegie Mellon and previously a VP Engineering at Google. But by all means, continue pissing while the rest of us appreciate the free content made available to us.

graycat · on June 6, 2015

If you want to go by names and titles, one of my Ph.D. dissertation advisors was J. Cohon. Since you are well acquainted with CMU ...!

But I'm judging Moore's materials based on the materials, not his employment history.

I have nothing against Moore; it's not about Moore or me. Instead, it's about what Moore wrote.

Google, CMU CS aside, sorry to tell you, or maybe it's good news, think of the good news, instead of that A. Moore material, there is much, much higher quality material going way back, e.g., already by, say, 1970. There's G. Dantzig, R. Gomory, R. Bellman, G. Nemhauser, Ford and Fulkerson, P. Wolfe (e.g., Wolfe dual in quadratic programming), R. Bixby, H. Kuhn, A. Tucker (prof of the prof that was the Chair of my Graduate Board orals), D. Bertsekas, J. von Neumann, J. Nash, R. Rockafellar, W. Cunningham, and many, many more. None of these people is in computer science.

E.g., there are stacks of books on multivariate statistics with linear discriminate analysis; there's log-linear for categorical data analysis; there's controlled Markov processes and continuous time stochastic optimal control, with, say, measurable selection, scenario aggregation. etc.; there's lots of material on resampling, the bootstrap (I have published a paper in essentially that topic); there's sufficient statistics from the Radon-Nikodym theorem; and much more.

Okay, just in regression and multi-variate statistics, just from my bookshelf, there's:

William W. Cooley and Paul R. Lohnes, 'Multivariate Data Analysis', John Wiley and Sons, New York, 1971.

Maurice M. Tatsuoka, 'Multivariate Analysis: Techniques for Educational and Psychological Research', John Wiley and Sons, 1971.

C. Radhakrishna Rao, 'Linear Statistical Inference and Its Applications: Second Edition', ISBN 0-471-70823-2, John Wiley and Sons, New York, 1967.

N. R. Draper and H. Smith, 'Applied Regression Analysis', John Wiley and Sons, New York, 1968.

Leo Breiman, Jerome H. Friedman, Richard A. Olshen, Charles J. Stone, 'Classification and Regression Trees', ISBN 0-534-98054-6, Wadsworth & Brooks/Cole, Pacific Grove, California, 1984.

And their mathematical notation is quite precise.

In simple terms, what Moore is doing in those notes is mostly, not all, some very fine, old wine, corrupted, and in new bottles with new labels. E.g., the 22 pages on game theory without mentioning just the simple linear programming solution is gentleman D- work.

Readers would be seriously mislead and ill-served not to hear that the Moore material is inferior, really, not good. We're talking grade C, gentleman B-.

Think of the good news: There's much, much better material long on the shelves of the research libraries although rarely as computer science.

That's just the way it is. People here on HN should be aware of that situation.

cs702 · on June 6, 2015

These slide tutorials are excellent: engaging and friendly but still rigorous enough that they can be used as reference materials. They're a great companion to "Introduction to Statistical Learning" and "The Elements of Statistical Learning" by Hastie, Tibshirani, et al. The author of these tutorials is Andrew Moore, Dean of the School of Computer Science at Carnegie Mellon.

ValentineSmith · on June 6, 2015

Thanks for mentioning those books! I'll check 'em out. I agree wholeheartedly with your assessment. I was really blown away by the clarity of the slides. Glad others can enjoy them too.