> I wouldn't consider estimating, say, the mean length of a population of fish contrived (unbiased estimate: x-bar). Nor would I consider estimating the probability of an event based on observations of the event happening or not happening contrived (unbiased estimate: p-hat = #successes/#trials).
Sure, maybe not contrived; my point is that flat priors may work in many "typical" textbook stats problems, but they are one of many choices, and that choice is important to be explicit about and not sweep under the rug. Because if your entire life is measuring sample means, fine, you're never going to need to think about this very much and life will be nice. But when one fine day you decide to do something more complex, these are the land mines that you shouldn't really ignore.
> These kinds of simple estimation problems and the associated statistical tests account for probably 90% of statistical practice. Dismissing them as contrived is silly.
Whether it's 90% is totally dependent on the types of problems you do. I don't mean to dismiss them, you're right for many problems MLE is just fine. I meant to illustrate that "unbiased" comes with many caveats, and that in many real scenarios flat priors are not ok.
> More generally, MLE estimates are always (under regularity conditions) asymptotically unbiased even if not unbiased for a finite sample. This means that the amount of bias decreases to zero as the sample size increases, no matter what the parameterization is.
Is this not true of the MAP for most priors? Gaussian/Laplace priors will have this property too, since priors become asymptotically less important the more data you have. If your prior is zero over some of the support, you're out of luck but this doesn't strike me as a good argument for MLE > MAP or for using flat priors everywhere. When we have infinite data, sure, priors are irrelevant, but we live in the real world where data is not infinite.
> Finally, there is very often a natural parameterization for any given problem. If you're interested in the arithmetic mean of a population, there's no reason to use a log-scale parameterization.
Sure, agree that parametrization isn't a problem a lot of the time, but it is something important to be mindful of, and this points towards, again, not forgetting that you are always using a prior and that you should think about whether or not that prior makes sense.
> Why worry about bias in other parameterizations when you can just use the natural parameterization, where the estimator is unbiased? Again, I don't think such scenarios are contrived: a very large proportion of statistical analyses deal with simple measurements in Euclidian (or very nearly Euclidian--we can typically ignore, for example, relativistic effects) spaces: real world dimensions, time, etc.
Yea, I mean sure: for easy problems, parametrization is obvious. That's kind of tautological. But sometimes it's not obvious, or sometimes for computational reasons you need to work with a log(theta) instead of theta, etc. If you're a frequentist and you're thinking life is great because you don't need to worry about priors, you're wrong and sooner or later you will get into trouble; be it a parametrization issue or something else, priors are not just something you can completely ignore. It's like saying "I always drive without looking in my rearview mirror" -- ok, great, you will be fine a lot of the time, but eventually one day you will change lanes on the highway at the exact wrong time, and you will really regret your habit of not looking in your mirror.
> If you're a Bayesian and very concerned about parameterization effects you can also use a Jeffrey's prior, which is parameterization-invariant. Notably, for the mean of a Normal distribution, the Jeffrey's prior is... the flat prior!
Yep, totally agree, I have no problem with Jeffrey's priors (when they make sense), and that's all well and good. Just to clarify: I am not saying "don't use flat priors" -- flat priors are extremely reasonable and a good idea in many cases, my point is flat priors are still priors, and you are still making a statement by using them: "lets assume all possible values of theta are equally likely a priori". Sometimes we don't really believe that but it's useful to see the implications of making this assumption. But sometimes priors are extremely important (e.g. we want a time-dependent measurement of a poisson rate, like conversions per dollar of ad spend, and conversions are relatively rare: priors are your friend here, e.g. a GP prior = Cox process or something else, even if this prior is an operational assumption)
> Yes and no. The Bayesian and frequentist approaches answer different mathematical questions, but they are used by humans to answer the same human questions, such as "do these two populations have the same mean?"
Yes, agreed.
> Indeed this is one of the primary reasons Lindley's paradox arises: the Bayesian model comparison (using marginal likelihoods or Bayes factors) gets tricked by the diffuse prior, while the frequentist model comparison (using null hypothesis testing) does not.
Ah lord, but this is a terrible justification for using null hypothesis rejection: we're almost always choosing a very simplistic distribution (e.g. Gaussian) to do this, and reducing the question to "we reject H0 because its very unlikely" is part of the reason why there's a replication crisis in e.g. social sciences, because they're taught this simplistic picture without any of the necessary nuance ('here are the assumptions we make, and under these assumptions + H0, it is a little bit unlikely that we would have observed x'). That's a recipe for disaster. Is it not much better to discuss the full posterior, "degrees of belief" and to be explicit about all of our uncomfortable prior assumptions? I prefer Bayesian model selection over null hypothesis rejection 100% of the time, especially because "Bayesian model selection" is the only logical way to do model selection, the only caveat is that it depends on reasonable prior assumptions and these are the hard part (but again, at least it is explicit!).
Also, the Lindley's "paradox" example certainly seems contrived: we believe there's a 50% chance that p = 0.5 exactly?? I just don't understand that type of analysis. Come up with a prior, derive your posterior, decide the answer to your question yourself (what is the chance that p=0.5 exactly? well, it's exactly 0%. How much more likely is it that p=0.5036 vs p=0.5? That's a better question...). By contrived, I mean that it appears designed to exploit the fact that Bayesian stats will automatically prefer simpler models, especially one with 0 degrees of freedom that is relatively close to the right answer, but that's a Good Thing (TM).
Both frequentist stats and Bayesian stats are easy to abuse: Bayesian stats gives a false sense of comfort because people don't worry enough about their choice of prior, but at least Bayesian stats is explicit about the prior!. I won't say that hypothesis testing is complete garbage, but it is quite dangerous and frankly dishonest to reduce things to a p value and pretend that's the end of the discussion.
> But when one fine day you decide to do something more complex, these are the land mines that you shouldn't really ignore.
> in many real scenarios flat priors are not ok.
> eventually one day you will change lanes on the highway at the exact wrong time, and you will really regret your habit of not looking in your mirror.
Can you give some examples where frequentists hit these alleged flat-prior landmines? I am admittedly a Bayesian by training, not a frequentist, so perhaps it's just my ignorance showing, but I'm not aware of any such situations.
Frequentist statistics generally relies on performance guarantees (bounds on the false positive error rate for tests, in particular, and coverage for confidence intervals) which are derived under the lack-of-explicit-prior, so as far as I can tell they should be doing fine. I'd be interested in seeing examples where frequentist analyses fail because of the (implicit) flat prior.
> we're almost always choosing a very simplistic distribution (e.g. Gaussian) to do this
The Gaussian distribution is a marvelous thing. The central limit theorem is, in my humble opinion, one of the most beautiful and surprising results in mathematics.
> Is it not much better to discuss the full posterior, "degrees of belief" and to be explicit about all of our uncomfortable prior assumptions?
Perhaps I'm just cynical, but I'd say probably not. A Bayesian decision process is still a decision process and still subject to all the problems that the frequentist decision process (null hypothesis significance testing) is subject to: inflated family-wise error rates, p-hacking (except with Bayes factors rather than p-values), publication bias, and so on. At best getting everyone to do Bayesian analyses might be roughly equivalent to getting everyone to use a lower default significance threshold, like 0.005 instead of 0.05 (which prominent statisticians have advocated for).
> I prefer Bayesian model selection over null hypothesis rejection 100% of the time, especially because "Bayesian model selection" is the only logical way to do model selection, the only caveat is that it depends on reasonable prior assumptions and these are the hard part (but again, at least it is explicit!).
Sadly there's a trap in Bayesian model selection (often called Bartlett's paradox, though it's essentially the same thing as Lindley's paradox) which can be difficult to spot. No names out of respect to the victim, but several years ago I saw a very experienced Bayesian statistician who has published papers about Lindley's paradox fall prey to this. Explicit priors didn't help him at all. He would not have fallen into it if he had used a frequentist model selection method, though there are other problems with that.
> Also, the Lindley's "paradox" example certainly seems contrived:
And here we are again calling a statistical test that thousands of people do every day "contrived." You already know how I feel about that.
Yes, it's a very simple example, because that helps illustrate what's happening. Lindley's paradox can happen in arbitrarily complex models, any time you're doing model selection.
> By contrived, I mean that it appears designed to exploit the fact that Bayesian stats will automatically prefer simpler models, especially one with 0 degrees of freedom that is relatively close to the right answer, but that's a Good Thing (TM).
Preferring simpler models is not exactly what's going on in Lindley's paradox, at least not the way that most people talk about Bayes factors preferring simpler models (e.g. by reference to the k*ln(n) term in the Bayesian Information Criterion). The BIC is based on an asymptotic equivalence and drops a constant term. That constant term is actually what is primarily responsible for Lindley's paradox, and has only an indirect relationship to the complexity of the model.
> Can you give some examples where frequentists hit these alleged flat-prior landmines? I am admittedly a Bayesian by training, not a frequentist, so perhaps it's just my ignorance showing, but I'm not aware of any such situations.
You're probably right: I myself am also a Bayesian by training as you can probably guess but went through the usual statistics education from a frequentist standpoint, and once I learned Bayesian statistics it was almost an epiphany and much more intuitive and understandable than the frequentist interpretation (but that's just me). In all honesty, I think good frequentist statisticians and good Bayesian statisticians have nothing to worry about, since both should know exactly what they are doing and saying as well as the limitations of their analysis.
I wouldn't put myself in either the "good frequentist" or "good Bayesian" categories, by the way, I am just an imperfect practitioner, but I think that's the case for most people. My argument against frequentist statistics for the masses is a practical one: I found myself getting into much more trouble and having much less insight into what I was doing when I had a frequentist background than I did when doing things from a Bayesian standpoint, and I see many imperfect frequentist statisticians like myself running into the same problems I used to (mostly ignoring priors when they shouldn't or thinking a flat prior is always uninformative, etc.), but I admit that is a wholly subjective experience. I never once thought about priors before learning Bayesian stats, and I find many people I meet with a frequentist background also forget the significance of priors because they also are imperfect practitioners.
> Frequentist statistics generally relies on performance guarantees (bounds on the false positive error rate for tests, in particular, and coverage for confidence intervals) which are derived under the lack-of-explicit-prior, so as far as I can tell they should be doing fine. I'd be interested in seeing examples where frequentist analyses fail because of the (implicit) flat prior.
Yea, I totally agree, I just find that statistics is important in many more contexts than just this. While you can do this sort of thing from a Bayesian perspective (using Jeffery's priors or whatever the situation calls for), in my experience frequentists have a tough time departing from this type of analysis once they start diving into areas where priors are important (unless they are also familiar with Bayesian stats!)
> The Gaussian distribution is a marvelous thing. The central limit theorem is, in my humble opinion, one of the most beautiful and surprising results in mathematics.
Agree with you, but CLM doesn't always help you. You may not always be interested in the statistics of averages in the limit of many samples. I agree when you are doing this, CLM is a godsend.
> Perhaps I'm just cynical, but I'd say probably not. A Bayesian decision process is still a decision process and still subject to all the problems that the frequentist decision process (null hypothesis significance testing) is subject to: inflated family-wise error rates, p-hacking (except with Bayes factors rather than p-values), publication bias, and so on. At best getting everyone to do Bayesian analyses might be roughly equivalent to getting everyone to use a lower default significance threshold, like 0.005 instead of 0.05 (which prominent statisticians have advocated for).
I disagree here. Discussing the full posterior forces you not to reduce the analysis to a simple number like a significance threshold, and to acknowledge the fact that there are actually a wide range of possibilities for different parameter values, and it's important to do this when your posterior isn't nice and unimodal, etc. I don't disagree that sometimes (well, many times) the significance threshold is all you really care about (e.g. "is this treatment effective, yes or no"), but that's still a subset of where statistics is used in the wild. E.g. try doing cosmology with just frequentist statistics (actually, do not do that, you may be physically attacked at conferences).
But again, I want to emphasize that doing Bayesian stats can also give you a false sense of confidence in your results, I don't mean to say Bayesians are right and frequentists are wrong or anything, I just mean to say that sometimes priors are important and sometimes they aren't, and I personally find that have an easier time understanding when to use different priors in a Bayesian framework than a frequentist one.
> Sadly there's a trap in Bayesian model selection (often called Bartlett's paradox, though it's essentially the same thing as Lindley's paradox) which can be difficult to spot. No names out of respect to the victim, but several years ago I saw a very experienced Bayesian statistician who has published papers about Lindley's paradox fall prey to this. Explicit priors didn't help him at all. He would not have fallen into it if he had used a frequentist model selection method, though there are other problems with that.
Like you say, there are problems with both approaches, but my point is that when the prior is explicit, we can all argue about its effects on the result or lack thereof. Explicit priors don't "help" you, but they force you to make your assumptions explicit and part of the discussion. If your only ever using flat priors, it's easy to forget that they're there
> And here we are again calling a statistical test that thousands of people do every day "contrived." You already know how I feel about that.
I don't mean to be flippant about it or dismissive, I mean exactly what I said:
contrived: "deliberately created rather than arising naturally or spontaneously."
What test is it in lindley's paradox are you referring to when you say thousands of people use everyday? Just the null rejection? Or is there another part of it you're referring to?
> Yes, it's a very simple example, because that helps illustrate what's happening. Lindley's paradox can happen in arbitrarily complex models, any time you're doing model selection.
My point isn't that it's simple, my point is that it's incredibly awkward and unrealistic and not representative of how a Bayesian statistician would answer the question "is p=0.5" which is a very strange question to begin with. The "prior" here treats it as equally likely that p=0.5 exactly and p != 0.5, which if that's your true assumption, fine, but my point is that this is a very bizarre and unrealistic assumption. Maybe it seems realistic to a frequentist but not to me at all. If someone was doing this analysis, I would expect to get a weird answer to a weird question.
> Preferring simpler models is not exactly what's going on in Lindley's paradox,
Exactly! I'm not sure what is going on in Lindley's paradox to be honest; I don't understand the controversy here: the question poses a very strange prior that seems designed to look perfectly reasonable to a frequentist but not to a Bayesian. But I suppose this is an important point about the way priors can fool you!
> at least not the way that most people talk about Bayes factors preferring simpler models (e.g. by reference to the kln(n) term in the Bayesian Information Criterion). The BIC is based on an asymptotic equivalence and drops a constant term.
I'm with you so far, and BIC is a good asymptotic result, but I'm talking about the full solution here (which is rarely practical*), that doesn't drop the constant term
> That constant term is actually what is primarily responsible for Lindley's paradox, and has only an indirect relationship to the complexity of the model.
I mean I think we're splitting hairs here? Maybe? My point was that Bayesian model selection won't make up for a strange prior, but given the right priors, Bayesian model selection just makes sense to me. But again, this is the important limitation of most Bayesian analyses: the prior can do strange things, especially the one used in the Lindley's paradox example in the Wikipedia page.
But honestly, if you think I'm missing some important part of Lindleys' paradox, please do elaborate, I have not heard of this before you mentioned it but I still am confused as to why this is considered something "deep" but I assume that just means I am missing something important.
Sure, maybe not contrived; my point is that flat priors may work in many "typical" textbook stats problems, but they are one of many choices, and that choice is important to be explicit about and not sweep under the rug. Because if your entire life is measuring sample means, fine, you're never going to need to think about this very much and life will be nice. But when one fine day you decide to do something more complex, these are the land mines that you shouldn't really ignore.
> These kinds of simple estimation problems and the associated statistical tests account for probably 90% of statistical practice. Dismissing them as contrived is silly.
Whether it's 90% is totally dependent on the types of problems you do. I don't mean to dismiss them, you're right for many problems MLE is just fine. I meant to illustrate that "unbiased" comes with many caveats, and that in many real scenarios flat priors are not ok.
> More generally, MLE estimates are always (under regularity conditions) asymptotically unbiased even if not unbiased for a finite sample. This means that the amount of bias decreases to zero as the sample size increases, no matter what the parameterization is.
Is this not true of the MAP for most priors? Gaussian/Laplace priors will have this property too, since priors become asymptotically less important the more data you have. If your prior is zero over some of the support, you're out of luck but this doesn't strike me as a good argument for MLE > MAP or for using flat priors everywhere. When we have infinite data, sure, priors are irrelevant, but we live in the real world where data is not infinite.
> Finally, there is very often a natural parameterization for any given problem. If you're interested in the arithmetic mean of a population, there's no reason to use a log-scale parameterization.
Sure, agree that parametrization isn't a problem a lot of the time, but it is something important to be mindful of, and this points towards, again, not forgetting that you are always using a prior and that you should think about whether or not that prior makes sense.
> Why worry about bias in other parameterizations when you can just use the natural parameterization, where the estimator is unbiased? Again, I don't think such scenarios are contrived: a very large proportion of statistical analyses deal with simple measurements in Euclidian (or very nearly Euclidian--we can typically ignore, for example, relativistic effects) spaces: real world dimensions, time, etc.
Yea, I mean sure: for easy problems, parametrization is obvious. That's kind of tautological. But sometimes it's not obvious, or sometimes for computational reasons you need to work with a log(theta) instead of theta, etc. If you're a frequentist and you're thinking life is great because you don't need to worry about priors, you're wrong and sooner or later you will get into trouble; be it a parametrization issue or something else, priors are not just something you can completely ignore. It's like saying "I always drive without looking in my rearview mirror" -- ok, great, you will be fine a lot of the time, but eventually one day you will change lanes on the highway at the exact wrong time, and you will really regret your habit of not looking in your mirror.
> If you're a Bayesian and very concerned about parameterization effects you can also use a Jeffrey's prior, which is parameterization-invariant. Notably, for the mean of a Normal distribution, the Jeffrey's prior is... the flat prior!
Yep, totally agree, I have no problem with Jeffrey's priors (when they make sense), and that's all well and good. Just to clarify: I am not saying "don't use flat priors" -- flat priors are extremely reasonable and a good idea in many cases, my point is flat priors are still priors, and you are still making a statement by using them: "lets assume all possible values of theta are equally likely a priori". Sometimes we don't really believe that but it's useful to see the implications of making this assumption. But sometimes priors are extremely important (e.g. we want a time-dependent measurement of a poisson rate, like conversions per dollar of ad spend, and conversions are relatively rare: priors are your friend here, e.g. a GP prior = Cox process or something else, even if this prior is an operational assumption)
> Yes and no. The Bayesian and frequentist approaches answer different mathematical questions, but they are used by humans to answer the same human questions, such as "do these two populations have the same mean?"
Yes, agreed.
> Indeed this is one of the primary reasons Lindley's paradox arises: the Bayesian model comparison (using marginal likelihoods or Bayes factors) gets tricked by the diffuse prior, while the frequentist model comparison (using null hypothesis testing) does not.
Ah lord, but this is a terrible justification for using null hypothesis rejection: we're almost always choosing a very simplistic distribution (e.g. Gaussian) to do this, and reducing the question to "we reject H0 because its very unlikely" is part of the reason why there's a replication crisis in e.g. social sciences, because they're taught this simplistic picture without any of the necessary nuance ('here are the assumptions we make, and under these assumptions + H0, it is a little bit unlikely that we would have observed x'). That's a recipe for disaster. Is it not much better to discuss the full posterior, "degrees of belief" and to be explicit about all of our uncomfortable prior assumptions? I prefer Bayesian model selection over null hypothesis rejection 100% of the time, especially because "Bayesian model selection" is the only logical way to do model selection, the only caveat is that it depends on reasonable prior assumptions and these are the hard part (but again, at least it is explicit!).
Also, the Lindley's "paradox" example certainly seems contrived: we believe there's a 50% chance that p = 0.5 exactly?? I just don't understand that type of analysis. Come up with a prior, derive your posterior, decide the answer to your question yourself (what is the chance that p=0.5 exactly? well, it's exactly 0%. How much more likely is it that p=0.5036 vs p=0.5? That's a better question...). By contrived, I mean that it appears designed to exploit the fact that Bayesian stats will automatically prefer simpler models, especially one with 0 degrees of freedom that is relatively close to the right answer, but that's a Good Thing (TM).
Both frequentist stats and Bayesian stats are easy to abuse: Bayesian stats gives a false sense of comfort because people don't worry enough about their choice of prior, but at least Bayesian stats is explicit about the prior!. I won't say that hypothesis testing is complete garbage, but it is quite dangerous and frankly dishonest to reduce things to a p value and pretend that's the end of the discussion.