Statistics vs. Machine Learning

zissou · on Dec 16, 2012

From one of the comments in the article:

"Economists, of course, broke away hard from mainline stats a while ago, calling it “econometrics” and reinventing names for EVERYTHING, plus throwing in a bizarre obsession with the method of moments. In terms of intellectual arrogance and needless renaming/duplication, economists are much worse than computer scientists and engineers."

This guy couldn't be more right about [most] [academic] economists. I laughed at the renaming of terms part because it is so true (classic example is RSS/SSR/ESS/SSE). Taking a course in the stats department alongside an econometrics course was bound to confuse any student. But just ask the economist, they'll tell you which one is right. :)

pseut · on Dec 16, 2012

> I laughed at the renaming of terms part because it is so true (classic example is RSS/SSR/ESS/SSE). Taking a course in the stats department alongside an econometrics course was bound to confuse any student.

Come on, this is silly. Calling things RSS vs. SSR might confuse an undergrad, but I've taken both classes (stats intro to linear regression + econ intro to linear regression) at the same time and saw barely a difference in the material. The stats class had a greater emphasis on finite-sample properties under normality and the econ class on asymptotics. Not a huge change, and the differences were complementary.

We both know that the "bizarre obsession with [generalized] method of moments" is because moments come out of models with rational agents, not distributions.

zissou · on Dec 16, 2012

I've taken PhD Econometrics where we touched zero data and was 100% theoretical over the 9 months, so I don't think we are talking about the same thing. I don't like spewing out economically technical non-sense on HN, so I wasn't going to go into the methods of moments part of that comment as it applies to the current state of econometric theory.

In econometric theory, method of moments comes up in the largest way in the form of generalized methods of moments (or GMM). The idea behind GMM is one that is in competition with maximum likelihood, with the point being that GMM doesn't force you to make arbitrary assumptions about the true probability distribution of the data, purely for the implementation of the model. This seems obviously attractive, because then the results of our model won't be jeopardized just because one of our assumptions was false. In other words, GMM provides a way to estimate the parameters of a model with out making assumptions about the population.

Oh, and the topic of rational agents is not relevant here. This is a purely statistical/philosophical argument.

But there goes the economist in me again... I'm sorry.

pseut · on Dec 16, 2012

I've taken just as many stats classes that don't touch data; I don't think either of us want to argue pedagogy of teaching on HN (fuck, I didn't even want to spell it and it's likely I didn't), but I don't think that there's a huge difference in how Econ and stats departments teach the same material or use terminology. (there are big differences in the material selected, obviously).

This statement can't be true: "GMM provides a way to estimate the parameters of a model with out making assumptions about the population.". You need assumptions, just different assumptions. Frequently those assumptions involve agent rationality, but not always (after all, MLE is a special case of GMM).

zissou · on Dec 16, 2012

"MLE is a special case of GMM"

You have it backwards. MLE is actually a special case of GMM.

I'm done here.

dbaupp · on Dec 16, 2012

You seem to have just repeated the quote? Presumably you mean "GMM is a special case of MLE"?

zissou · on Dec 16, 2012

Mistake is on my side, I apologize. I read what the poster said backwards myself. :) The repeated quote is a true statement: MLE is a special case of GMM.

dbecker · on Dec 16, 2012

GMM requires a weights matrix, which implicitly does the same thing as distributional assumptions do in MLE (and you can replicate most MLE models in GMM by appropriate choice of weights matrix)

ced · on Dec 16, 2012

As a Bayesian, this sounds very intriguing. Could you expand upon GMM/provide a good reference?

dbecker · on Dec 16, 2012

Good reference: For most things in econometrics, the publicly available lecture notes by Jeff Wooldridge and Guido Imbens are excellent. I don't recall their GMM notes specifically, but that'd be a good place to start (just google for it).

Super-quick explanation: Write down a model, and a bunch of conditions that should be true at the correct parameter values.

As a trivial example, the conditions might be that the errors/residuals are orthogonal to the explanatory variables. Put mathematically, the expected value of

X*epsilon=0

where X is the matrix of explanatory data, and epsilon is your vector of errors. Since the errors are a function of the parameters (beta), a can find the optimal parameter values by solving your moment conditions.

You frequently have more conditions than parameters, so the conditions can't all be true at once. Then the computer tries to get the conditions as close to true as possible... where closeness is defined by how you weigh the importance of each individual condition.

Ideally the conditions arise transparently from the model. Executed poorly, it can seem ad-hoc.

A quick google search should gives better explanations than my response in comments :)

dbecker · on Dec 16, 2012

Gmm's popularity among econometricians isn't motivated by anything related to rational agents. GMM is popular in econometrics because it is a natural expression of IV estimators.

IV regressions aren't popular among statisticians or machine learnists, so this isn't an issue for them.

Even in structural econometrics (e.g. BLP), they are using moments to deal with endogeneity... the rationality angle is a red herring.

pseut · on Dec 16, 2012

I'm mostly familiar with macro and finance applications; I'll take your word for it that it's not strictly coming from agent rationality in other subfields. Perhaps I should have said, "unforcastability" which would have covered natural experiments and IV as well.

dbecker · on Dec 16, 2012

Oops. I wasn't thinking about macro when I disagreed with your earlier claim. Your explanation sounds consistent with my distant recollection of the field (though I'd also take your word for it.)

I shouldn't have disagreed so strongly in the first place.

john_horton · on Dec 16, 2012

Maybe the fact that moment conditions are often first order conditions for some agent's optimization problem is why rationality was mentioned, but I'm just speculating.

dbecker · on Dec 17, 2012

That may have been why it was mentioned, but within micro applications, that is getting the relationship flipped.

Micro models estimated from agent's optimization problems are more frequently estimated using MLE (John Rust's GMC bus paper being a canonical example, though still true in the current literature e.g. Nevo, or Bajari and Hong).

Micro models estimated from aggregated data frequently lack agent-level optimization, and those are the models more frequently estimated with gmm (examples here would include the hundreds of papers based on Berry, Levinsohn and Pakes).

So, this explanation doesn't seem to hold in micro contexts.

Though, as the previous commenter pointed out, macro is quite a bit different, and your explanation is probably correct there.

josemariaruiz · on Dec 16, 2012

Philipp K. Janert, author of «Data Analysis with Open Source Tools», spends a few pages for explaining how he perceives this «difference».

From his point of view Machine Learning is a fake science. Fragile, secretive and specific techniques for big problems that need secret parameters that have never been published for their application. This parameters will be supplied to you for a price by the inventors-researchers' companies.

In the other hand, statistics is real science, where everything is published and studied by a whole community. A science that has accumulated hundred of years of experience and that offer all its knowledge in any university. The methods offered by statistics are of broad application, robust and open.

And I think he has a point in this reasoning.

PD: Statistics works, ask in any hard sciences. Its contributions has been essential for the science in the last centuries. Machine Learning was bashed (like old AI) because it never offered real solutions or helped us to advance in our understanding of anything. Machine Learning is a tool, not a science, that tries to cope with the limitations of our knowledge, which means that it's a very convenient tool for engineers and problem solvers, as are numerical methods are, but it means too that its results share the problems of numerical methods.

rm999 · on Dec 16, 2012

I disagree with a few things about your comment. Your criticism of machine learning feels off-base, and is too specific to describe such a wild field. Who sells parameters? What does that even have to do with whether it is a science? I can think of other fields where every detail of an experiment isn't spelled out in every paper.

It's hard to separate machine learning and statistics because so much of machine learning derives directly from statistics. Motivation is probably the most important distinction; machine learning is applied statistics. I'd say it's a mix of science (the scientific method plays a big part in model building for example), engineering, and math. Statistics is first and foremost a branch of mathematics, not science; the scientific method does not play a role in the vast majority of the field.

alook · on Dec 16, 2012

> machine learning is applied statistics

It really is hard to separate ML & statistics - any competent practicioner of ML appreciates the statistical achievements that made Machine Learning methods possible. And statisticians must understand that to help automate decision-making systems, using learning methods/boosting is a viable option.

The debate around nomenclature (ML/stats/AI) seems limited to the academic community. Most data scientists I've met tend to accumulate a repertoire of tools from different fields, rather that side with either Machine Learning or Statistical communities.

east2west · on Dec 16, 2012

I don't see such a sharp divergence as a computer science major working in a top statistics department. Statistics does not magically work well in hard sciences and machine learning has worked very well in many fields. I don't know if you consider biology hard science but biostatistics has many pitfalls many sciences fall into, and highly regarded statisticians can come to opposite conclusions on same data.

Statistical theory is actually applied probability, often called mathematical statistics. Math majors would feel right at home. In fact math majors look down on statistics major for lack of rigor and they have a point. Two years of classes in a five years phd program isn't enough to perfect knowledge of measure theory. But the more important problem is that stats training leave out computing enough that it impacts works coming out of stats department.

This gets back to difference between machine learning and statistics. Machine learning research embraces all fields of engineering, approximation and estimation, numerical analysis, optimization, plus statistics. Since so much of scientific advances can be attributed to computational improvements, it is natural that the more computational oriented fields are ahead of less computational fields. LASSO has been all the rage in statistics recently when it largely relied on works in convex programming. And signaling processing community in EE and CS are leagues ahead of statisticians in the sophistication and scales of problems they can tackle. Computational statistics is an attempt to remedy computational shortcoming of traditional statistics, but we have yet to see visible impact in term of high-impact work from statistics departments.

Having said all that, computer science departments do have the problem of not fully understanding the statistical foundation of machine learning methods. But this is not the case for CS in prestigious schools such as Berkeley and Stanford and MIT. Work coming out of these places are theoretically sound yet application oriented. One needs to just look at NIPs papers to appreciate the breadth and depth of expertise available in machine community.

For a good reading on bridging statistics and machine learning, read paper by an inventor of random forest, Leo Breiman, "Statistical Modeling: The Two Cultures." It is a well regarded paper by a renown probabilist and statistician who cares about utility of statistics as used in the real world.

mjw · on Dec 16, 2012

> Fragile, secretive and specific techniques ... parameters will be supplied to you for a price by the inventors-researchers' companies

This seems a strange/dated/paranoid view of machine learning. Perhaps it has been true historically (have any references?) but doesn't ring true for me of the field these days as I've seen it.

Hyper-parameter selection can be tricky, and some papers do handwave about it when evaluating models, although this kind of flaw is increasingly picked up on by reviewers I think.

At any rate you'll find a lot of useful literature on hyper-parameter optimisation techniques especially for the more popular and general ML models. It's recognised as an important and interesting (albeit sometimes hard and fiddly) problem, not something to be swept under the rug, and not the stuff of conspiracies.

dbecker · on Dec 16, 2012

Most of this is true in a strict sense, and it's disappointing that it is presented in such a judgmental way.

Machine learning typically focuses on prediction. There are lots of business problems where prediction is the #1 goal, and ML is great in these circumstances.

Statistics typically focuses on understanding and summarizing data/findings. This is frequently closer to the needs of scientists.

Accepting that classical statistics has contributed more to science than machine learning doesn't make it better. That's like saying "A pipe-wrench is a better tool than a hammer, just ask any plumber."

The fields have a lot of similarities, but different use cases. Most claims that one is "better" come from a tight focus on specific use cases.

jlogsdon · on Dec 16, 2012

The site isn't loading for me, here's the text-only cache: http://webcache.googleusercontent.com/search?q=cache:http://...

mbq · on Dec 16, 2012

When you remove steering, both groups simply do statistical modelling. From my observations, the main difference is that stats people make simpler models, test well, but are a victim of mathematical tendency to build on too optimistic assumptions; CSists are closer to the data, make more complex models and do poor verification (but thus report better accuracies). Regardless of the actual performance, stats has traditional monopoly in biomedics (minus neuro* stuff) and CS in engineering.

pseut · on Dec 16, 2012

Maybe a moderator could change the link title to highlight that it's from 2008. Not that the date necessarily makes it irrelevant, but it's not really "news".

mjw · on Dec 16, 2012

I've taken courses from people in a few of these overlapping but historically-different camps recently, e.g.:

- Frequentist statisticians - Bayesian statisticians - Old-school AI researchers - Statistical learning theorists - Bayesian machine learners - Engineers working on optimisation with noisy data - Information retrieval folks - ...

I'm really keen to see these guys starting to talk to each other and unify more of what they're doing around statistics (Bayesian and frequentist, parametric and non-parametric, generative and discriminative) as the common language and framework. Hopefully expanding the horizons of statistics a bit as a field in the process.

I imagine it'll take a while longer though, and some of the differences in terminology and talking-past-each-other can be a bit maddening in the mean time for those learning. What would be really nice would be a course offering a broad, well-rounded introduction to the various different philosophies to modelling data, their histories, interactions and overlaps, differing goals, strengths and weaknesses. It can be hard to get a sense of this when most introductory courses are taught by someone who's implicitly from one camp or another, even if (as is usually the case) they're not overly ideological about it.

One criticism of (some, not all!) statisticians is that they can seem to have a strange and rather limiting fear of computation which leads them to give undue preference to computationally simple models, even when the dataset isn't big enough to make compute time an issue.

I can see an argument for pedagogical reasons why one would rather not teach (or learn!) the details of fiddly optimisation algorithms -- but having a basic literacy in optimisation can free you up to treat the algorithm for optimising your objective as a black box to some extent. Feeling less "guilty" about this (whether the compute time itself, or remembering all the details of the optimisation algorithm) can be quite freeing in allowing you to think about the modelling itself in a more powerful, modular framework. This seems to be where machine learning has gotten a big advantage.

On the other hand machine learners can be frustrating in the way they reinvent statistical terminology and methods. Also in a gratuitous tendency to skip the modelling stage and go straight to inventing different objective functions to optimise, or even straight to the algorithms. Leading to opaque black-box methods which (due to the lack of probabilistic motivation) make it harder to reason about uncertainty in a principled way.

With the increasing popularity of Bayesian machine learning I think this is less of an issue though, it's bridging the gap between the two camps. One can also find a lot of more modern ML research using Statistical Learning Theory as a nice framework for principled frequentist analysis of their models.

michaelochurch · on Dec 16, 2012

Machine learning is more accurately described (in my view) as an interdisciplinary region that involves computation and statistics both. It's when you have to use statistics to push the bounds of computational work, or CS to tease out statistical relationships that require so much data that the size of the data itself is part of the problem.

When I think of "statistics", I think of problems where there are a few well-studied parametric approaches, in large part because there aren't enough data (in most cases) to do anything extremely complicated. If your data set is small and you need to build a generally adequate model, you can often use linear regression to get something good enough, and that may the best you can do, because model simplicity is often valuable in its own right (low risk of overfitting).

Many of the more non-parametric machine learning algorithms (e.g. neural nets) are computationally intense and often a bitch to debug. They work best in situations where (a) you have a lot of data, but (b) you have no idea what the structure of the relationship should be.

Parametric models perform extremely well when the structure is known. You have inputs X1, X2, X3 and Y, and you know that a linear model will work well. You fit one, and it captures 65% of the variance. Good. That may be enough.

What do you use, however, if 65% isn't good enough, or if there's a special case that becomes economically important. You can look at the residuals. They may be random noise. If they're truly "random", then the linear model is the absolute best you can get. That may not be, however. There might be a region of the X2-X3 space where atypically high Y-values occur for a structural reason. These atypical points may be connected to an X4 that would otherwise be discarded (because it had no general correlation with Y).

I think that machine learning approaches start to be worth their additional complexity in these cases where (a) there are important relationships yet to be discovered-- no one knows about them!-- and (b) those are, or might be, economically critical.

visarga · on Dec 16, 2012

We can prevent overfitting by adding a regularization term to the expression we are minimizing, such as the sum of the squares of all coefficients scaled by a factor.

Also, if you do testing (using a separate test dataset to determine how well your model works on unseen inputs) you can determine if you are overfitting (learning even the noise present in the data - which is detrimental) or underfitting (not learning enough from the data - which is detrimental, too). In the end it's a sweet spot, and many times the features number in the hundreds or thousands, so you can't analyze by hand.

Automatic feature selection and disentangling is an amazing new advancement that came 7 years ago with the deep learning papers. Watch lectures on Restricted Boltzmann Machines by Geoffrey Hinton and Andrew Ng for this. It's what allowed Google to achieve the best speech recognition and image recognition results ever recorded.