Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The prior can generally only be understood in the context of the likelihood (arxiv.org)
94 points by selimthegrim on Aug 25, 2017 | hide | past | favorite | 20 comments


Nice article, in the sense that I would assign it to students, although I think it presents sort of a strawman and doesn't really introduce anything that hasn't been discussed at length elsewhere.

It provides a nice introduction to types of objective priors, hinting at their advantages and disadvantages. Also a nice discussion of why priors matter. Incidentally, I agree that "subjective" and "objective" are poor labels for priors--something better would be something like "estimand-predictive" and "inferential-property" priors, respectively.

I'm not really sure why they suggest objective priors violate assumptions about not using information from the data. The model, and thus many objective priors, can be specified given the design, without any knowledge of what the data looks like. This is the strawman part of it.

I get the sense Gelman has been wrestling with objective priors the last few years.


Odd because I read the abstract of the paper as a claim for a fundamental advance in statistics; systematic and objective determination of priors in Bayesian analysis. I can't speak to whether the paper is successful in delivering this as I am very dense and it takes me weeks to understand things, but if they have identified mechanisms that allow complex priors to be constructed that minimize overfitting:

1. I will do a little dance

2. I will learn how to construct such priors

3. I will attempt to apply this in practice to see what happens.

I don't see overfitting risk as a strawman, I see it as a nasty business that means that production models can't be trusted.


A lot of what they discuss is in the literature on reference priors, if not in other literature on objective priors as well.

It's a little complex for a comment on HN, but IMHO the best formalization of overfitting is in the literature on minimum description length, and related information-theoretic literature (https://en.wikipedia.org/wiki/Minimum_description_length ; the wikipedia page is a little off on some things but the general points are probably about right).

The relationship between MDL/NML and Bayesian statistics is a little complicated--Barron, Roos, and Watanabe have a nice recent paper about it (https://arxiv.org/abs/1401.7116) -- but the short story is that there's a certain equivalence (or at least very close relationship) between MDL/NML and Bayesian inference with reference priors (i.e., the "capacity-achieving prior" in IT parlance).

So Bayesian inference with reference priors minimizes risk of overfitting in a very technical minimax sense.

The problem is that reference priors are very difficult in general to construct, although that is changing rapidly, and they have been worked out for certain important cases (this paper provides an interesting new approach with a nice overview of recent papers https://arxiv.org/abs/1704.01168). The Jeffreys prior is a form of the reference prior for models meeting certain constraints.

A lot of the issues Gelman et al. touch on have been written about in various places in the MDL/NML/reference prior/information theory literature.


> but IMHO the best formalization of overfitting is in the literature on minimum description length

It's always surprised me how few people know about MDL, considering it's pretty darn close to what we might consider a "universal predictor" (granted, the difficulty with MDL is that it's uncomputable in the general sense). Even among most data scientists I know, very few understand what overfitting really is (and thus cross-validation and regularization are merely tools to what seems like the blackbox "art" of model selection).

But the concepts of MDL/MML and Kolmogorov complexity are very deep and fundamental—to such an extent that I think the path to true AGI will rely much more heavily upon algorithmic information theory than neural networks in the future.


Yeah, I have a similar reaction about being surprised MDL and algorithmic statistics isn't better known. As you say, it's very fundamental stuff. It seems so fundamental to me that I just sort of assume without thinking about it or even questioning whether it will eventually become more prominent.

My guesses as to why it's been slow to be adopted so far are that (1) it's relatively new, in the grand scheme of things, (2) certain things about it are really challenging to everyone, and (3) it has a certain perspective on inference that can be alien to a lot of people.


Also (5) it doesn't really work or make sense.

The data isn't necessarily representative of the domain theory, in fact in a lot of domains the data isn't, because the domains are so large that you can't capture the whole of it in a tractable training set, for example : images. Other data doesn't capture the domain because the data is generated in a regime that isn't operant when gathered - for example bull runs vs. bear runs in the markets. Bayesian analysis is attractive, we can include informative priors that capture our knowledge that in circumstances outside of the data other determining behaviours exist. This is also one of the reasons why deep networks can outperform support vector machines; deep networks can learn to prefer domain theories that are not the minimal descriptive one.

The other thing that is interesting about MDL is where does the idea that the minimal theory is the right theory come from?

Most people say "oh it's Occam's razor" but where did Occam's razor come from - who was Mr (Fr.) Occam?? Well, he was a 13th century philosopher - part of the Cambridge school and part of the tradition of Scotus invented to construct a story that supported the Trinitarian God... and this is why we prefer the idea that "entities will not multiply beyond necessity" because it says that you have a Trinity because The University ABSOLUTELY cannot work without it, and that's why you have three and not two and not four.

I am happy with all this but why should we think it's a good way to do machine learning? After all there are lots of examples of theories that were simple but don't work as well as complex alternatives - Gravity is a good one.


He (and the Stan team) have been my main source of 'default priors' where I don't have a strong opinion, and am mainly looking for regularization-style properties (most of the time).


Many of the Gelman-style default "objective" priors aren't so hard to interpret as "I am skeptical that this parameter is non-zero".


Gelman has been pretty adamant about the idea there probably always is some real correlation between almost any two measurements (that is why he advocates his type M and type S error concepts).

Perhaps you are confusing "skeptical the parameter is far from zero" with "skeptical the parameter is non-zero". It makes a big difference.


> With the above prior and likelihood, the posterior for β is a product of independent Gaussians with unit variance and mean given by the least squares estimator of β. The problem is that standard concentration of measure inequalities show that this posterior is not uniformly distributed in a unit ball around the ordinary least squares estimator but rather is exponentially close in the number of coefficients to a sphere of radius 1 centered at the estimate.

Of course the posterior is not uniformly distributed in a unit ball (the density is higher at the center of the ball, and extends beyond the limits of the ball). But the fact that it is exponentially close to a sphere doesn’t have anything to do with that! If the posterior was uniformly distributed in a unit ball it would also be exponentially close to a sphere.



> For a fully informative prior for δ, we might choose normal with mean 0 because we see no prior reason to expect the population difference to be positive or negative and standard deviation 0.001 because we expect any differences in the population to be small, given the general stability of sex ratios and the noisiness of the measure of attractiveness.

That’s a very strong prior. To put it into context (if I did my calculations right): assuming that every single child from very attractive parents in the study was a girl (let’s say 600 out of 600) while for the rest we have the expected distribution (i.e. 1150 girls out of 2400) the estimate of the difference would be 0.1%.


> Stein's trick was to notice that the point µ = 0 has the property that if y is sufficiently close to it,

It’s not clear in this presentation but shrinkage can be done toward any arbitrary point, not just toward the origin.


Bayes assumes/requires conditional independence of observations; which is sometimes the case.

For example:

- Are the positions of the Earth and the Moon conditionally independent? No.

- In the phrase "the dog and the cat", are "and" and "the" independent? No.

- In a biological system, are we to assume conditional independence? We should not.

https://en.wikipedia.org/wiki/Conditional_independence

...

"Efficient test for nonlinear dependence of two continuous variables" https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4539721/

- In no particular sequence: CANOVA, ANOVA, Pearson, Spearman, Kendall, MIC, Hoeffding


From https://plato.stanford.edu/entries/logic-inductive/ :

> It is now generally held that the core idea of Bayesian logicism is fatally flawed—that syntactic logical structure cannot be the sole determiner of the degree to which premises inductively support conclusions. A crucial facet of the problem faced by Bayesian logicism involves how the logic is supposed to apply to scientific contexts where the conclusion sentence is some hypothesis or theory, and the premises are evidence claims. The difficulty is that in any probabilistic logic that satisfies the usual axioms for probabilities, the inductive support for a hypothesis must depend in part on its prior probability. This prior probability represents how plausible the hypothesis is supposed to be based on considerations other than the observational and experimental evidence (e.g., perhaps due to relevant plausibility arguments). A Bayesian logicist must tell us how to assign values to these pre-evidential prior probabilities of hypotheses, for each of the hypotheses or theories under consideration. Furthermore, this kind of Bayesian logicist must determine these prior probability values in a way that relies only on the syntactic logical structure of these hypotheses, perhaps based on some measure of their syntactic simplicities. There are severe technical problems with getting this idea to work. Moreover, various kinds of examples seem to show that such an approach must assign intuitively quite unreasonable prior probabilities to hypotheses in specific cases (see the footnote cited near the end of section 3.2 for details). Furthermore, for this idea to apply to the evidential support of real scientific theories, scientists would have to formalize theories in a way that makes their relevant syntactic structures apparent, and then evaluate theories solely on that syntactic basis (together with their syntactic relationships to evidence statements). Are we to evaluate alternative theories of gravitation (and alternative quantum theories) this way?


>"This prior probability represents how plausible the hypothesis is supposed to be based on considerations other than the observational and experimental evidence (e.g., perhaps due to relevant plausibility arguments)."

I guess I don't know about how "Bayesian logicism" differs from "Bayesian probability", but this is totally false in the latter case. The prior is just supposed to be independent of the current data (eg devised before it was collected). In practice info almost always leaks into the prior + model via tinkering. That is why a priori predictions are so important to proving you are onto something.


Bayesian logicism is the logic derived from Bayesian probability.

Magic numbers are an anti-pattern: which constants are what and why should be justified OR it should be shown that a non-expert-biased form converges regardless.


The use of the term prior probability in that paragraph is not consistent with its use in bayesian probability, so something is wrong.

Also, I am not sure what magic numbers you are referring to.


Arbitrary priors are magic numbers.

Is there a frequentist statistic that can be used in a deterministic function to determine which arbitrary priors to use?


What does Bayes say when we swap A and B?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: