Bayesian vs. frequentist seems to have little to do with the underlying issue.
If clinical trialists use p-values wrong, how is moving to Bayesian methods going to be less misused and misunderstood?
The real issue is the the established practice in the research field. It's hard to introduce new methods if peer review is not familiar with them, or if everyone in the field has problems with such basic concept as p-values. Researchers have the tendency to apply statistical tools they have learned mechanically and peer review accepts them mechanically. New tools need more thought, not less.
> If clinical trialists use p-values wrong, how is moving to Bayesian methods going to be less misused and misunderstood?
Bayesian methods have the advantage that, while they can still be misused and misunderstood, at least when used correctly they tell us what we want to know and are easy to interpret (a posterior probability is exactly what you think it is) whereas p-values are hard to interpret correctly (a low p-value can imply a large effect size, a large sample size, or in the face of publication bias, lord knows what).
That said, I do think that the advantage to Bayesian thinking in research would mostly not be about the methods but about the attitude: aside from a couple of weirdos pushing Bayes factors, Bayesian statisticians almost universally communicate results using credible intervals (HPD) instead of dichotomizing the evidence into significant or not significant. Frequentist confidence intervals will get you most of the way there and could completely replace p-values, but if you're going to advocate for better statistics and uproot established practices, might as well go all the way and encourage better methods and better ways of communicating results at the same time.
> By using posterior credible intervals, we might reject the
null, but by using Bayes’ rule directly we see that this rejection is made prematurely as
there is no decrease in the plausibility of the zero point.
This could totally come up if you have a non factorising multivariate posterior distribution, and you want some sound way to summarise it when you can't reason from the marginals alone.
Post talk about the con of Frequentist compare to Bayesian and end up talking about Fisher likelihood school of thought.
Post argue that Likelihood is basically Bayes, it's actually Fisher school's of thought which is a half way point or a can be seen as a compromise between Frequentist and Bayesian approach.
A good book that talk about this which I've been meaning to read is... In All Likelihood: Statistical Modelling and Inference Using Likelihood by Yudi Pawitan.
I've read the first chapter and it's a very interesting read. I might skip bayes and go straight to likelihood coming from the school of frequency.
> A good book that talk about this which I've been meaning to read is... In All Likelihood: Statistical Modelling and Inference Using Likelihood by Yudi Pawitan.
It's good but it's mostly a technical introduction to statistical inference (i.e. the mathematics that make statistics work), and the remarks about the differences between different schools of statistics are mostly short asides. It's not a great recommendation for non-statisticians who want more insight into the different schools of statistics.
Also, both frequentist and Bayesian methods rely on likelihoods as an intermediary, so I guess you could say it's a "compromise" but only in a very uninteresting way.
The likelihood L(theta|data) is simply shorthand for P(data|theta), and to get a posterior probability you then simply apply Bayes' theorem to get P(theta|data) ~ P(data|theta) * P(theta), and this last part is your prior.
You can't "skip bayes and go straight to likelihood", because as you said it's only halfway there. Fisher tried to go farther with his fiducial approach, by the way.
Frequentist and Bayesian statistics are different tools that answer different questions. Arguing whether one is better than the other is like arguing whether one should drink water or eat bread.
I also think that a lot of the criticism against Frequentist inference comes from people who have experienced trouble applying a clean Frequentist solution in a more complex case and thus switched to Bayesian statistics as a replacement that is easier to handle.
> Frequentist and Bayesian statistics are different tools that answer different questions. Arguing whether one is better than the other is like arguing whether one should drink water or eat bread.
Astronomy and astrology also answer different questions, it's just that astrology doesn't answer the questions we're interested in, and it doesn't even do a great job with the questions it does purport to answer.
When quantifying uncertainty, the only quantity of interest is the posterior probability, which Bayesian methods provide and frequentist methods don't. Period. The only reason we accept a likelihood instead is because it can provide a reasonable approximation and might be faster or easier to calculate, and the only reason we'd accept a p-value over the full likelihood function is when we need a quick-and-dirty metric of the importance of a variable or an intervention. These are all valid reasons, and heck, few things are more useful than a quick (frequentist) OLS regression, but to then conclude that frequentist and Bayesian methods are entirely equal and complementary is disingenuous.
To me, the popularity of frequentist approaches in the literature means that I see many misapplications and erroneous applications to get that holy p<0.05.
Bayesian thinking is not that popular with these people so it's not 'tainted' yet.
It's a bit like the popularity of Excel - we see many people complain about Excel's automated changing of strings to dates, for example. If we'd all switch to R to fix that problem everybody would complain about stringsAsFactors=T instead.
> comes from people who have experienced trouble applying a clean Frequentist solution in a more complex case and thus switched to Bayesian statistics as a replacement that is easier to handle.
Sounds to me like a good reason and a good outcome. Don't you think?
Probability is a weird concept. It's much more philosophical than it appears at first glance, but much less philosophical than something like the interpretations of quantum mechanics.
Ultimately, what is a "probability"? It's a number that can be used for making predictions about the future. Neither Bayesian methods nor frequentist methods are "ideal" in the sense of predicting (computing) the future — in fact, results from algorithmic information theory put the best bounds on what we can "hope to predict". Very loosely speaking, the best predictor is the shortest program that reproduces the data. But this is uncomputable in general, which means we use something like minimum message length (MML) or minimum description length (MDL) — which are also sometimes uncomputable, but a bit more manageable at least.
In some situations, Bayesian methods can be shown to be equivalent to MDL, in the sense that sizeof(model) + sizeof(parameters) + sizeof(residual data) is a log reformulation of Bayes theorem.
Wikipedia does IMO an excellent job explaining the notion of a confidence interval. On the other hand, I find the idea of being "73% certain" that something will happen much harder to understand. A percentage implies a ratio, but Bayesians never explain what the numerator and denominator are.
The interpretation of the probability is the same as in frequentist statistics, except you're making statements about the model resulting from your assumptions and data, instead of some hypothetical experiment. I suppose the Bayesian approach is more about building the model whereas the frequentist approach is more about selecting the best model out of several.
>The interpretation of the probability is the same as in frequentist statistics
Not at all. Frequentists cannot define a probability on whether it will rain in a location on a given day. They will respond that such a probability is meaningless. Bayesians can, however, give a meaning to it.
True, but the way a Bayesianist (?) will assign meaning to it involves creating a model, based on some assumptions, which will return a probability. The Bayesian notion of probability is equivalent to the frequentist notion of probability for experiments done on that model. In that sense they are the same.
One way to think of it is in terms of the expected value of betting on being correct. Even if the bet only happens once ever, I still need a way to map how confident I am to how much I think the bet is worth. If I am willing to pay up to $0.73 for a bet that pays out $1 when I'm correct then I am 73% certain.
I should say that the Bayesian point of view is about thinking that all your computed probabilities are only the result of partial information, more data means changing those probabilities using the new information and Bayes' Theorem.
In the Bayesian way of thinking you are always wondering if there is another source of information that can enhance your probabilities. Like in the random forest method, you try to gain new knowledge from many sources.
"I really like penalized maximum likelihood estimation (which is really empirical Bayes) but once we have a penalized model all of our frequentist inferential framework fails us. No one can interpret a confidence interval for a biased (shrunken; penalized) estimate."
Not sure I completely understand it, but good point.
Correlation, no mater how "strong" is not a causation. A causation must be proved experimentally. Statistical sampling does not constitute a valid experiment. Statistical inference is not an equivalent to a logical inference and they cannot be used interchangeably. Unproven statistical models are no better than astrological calculations or metaphysics.
Bayesian statistics is a fairly sensible approach to certain types of statistical problems, so I'm not sure why its proponents always seem to talk about it as if they're pushing some new religious movement with a lot of nudity, or Amway.
In my first stats lecture at university the lecturer informed us we would not learn about Bayesian statistics and if we'd wanted to do that we should have gone to York or something.
Why York in particular? I mean, was the subtext "York, which is our ancient enemy" or "York, which everyone knows is a grossly inferior university" or what?
Nobody seems to be capable of explaining this properly. It's like monad tutorials, they explain what happens while mistakenly thinking they are telling you why it happens. I keep trying to fit this idea into my head and I can't because the information is not given.
- Where did this difference come from? When did it develop?
- What are the basic premises that a Bayesian believes that a frequentist doesn't, and vice versa? Reason it all the way through front and back.
- What does the B/F's model look like? What are the pieces they use, how are they arranged, what are the dependencies, how does causality flow?
- Why are the choices made by one invalid for the other's model? Where do they agree deliberately despite this?
- What are the consequences in the real world? Give me a real example on why this difference matters? "Real" meaning I don't care about dice, I care about engineering and science.
Instead you get some bullshit about fitting a distribution you don't understand to a model you can't see, while relying on understanding the nuances between words like probability and likelihood which is what you are trying to learn in the first place. Plus I swear the numbers agree in 99% of the "examples" given, with some handwaving "but it's different" to excuse it.
Fucking explanations, how do they work? Not in academia.
They're two different academic traditions for what constitutes Good Statistics. They're originally rooted in the philosophical dispute over whether to treat probabilities as frequencies of random outcomes ("frequentist") or as degrees of plausibility ("Bayesian").
In actual fact, a well-trained frequentist knows exactly how and when to use Bayes' rule for gambling, and a well-trained Bayesian knows exactly how and when to publish a paper with a p-value.
The really important difference is over how a whole field expresses its consensus or tradition about what constitutes strong evidence or a plausible theory. A Bayesian would like researchers to elicit priors before experiments (which express something like what reviewers' expectations will be about the experiment), and then calculate posterior distributions after experiments. We could thus then trade off "weak" and "strong" experiments against prior beliefs, while also reducing publication bias' pernicious effect on statistical strength -- or so Bayesians claim. Bayesian methods are also usually more computationally intensive and can make use of small sample sizes.
Frequentists had a lot of disagreements with that sort of thing, and so Neyman-Pierce and Fisher and the like developed a whole lot of statistical methods that don't rely on ever treating a probability as a belief. They preferred to differentiate clearly between a frequency of experimental outcomes, and what researchers think. They figured that Bayesian "priors" were subjective, biased, and untrustworthy. Also, quite importantly, their methods involved a lot less rote computation and instead made use of impressively large experimental samples.
Depending on which tradition you were raised in, and which philosophers of science you side with, you can argue until the end of the world about which one's better. My advice? Use whatever your field demands you use to publish, but be Bayesian on the inside.
I'm not a statistician, and have only studied frequentist statistics (I assume that's the standard taught in introductory stats courses in school).
Like the person at the root of this thread, I have struggled with explanations on why Bayesian is so great. The answers that worry me tend to be along the lines of "Well, suppose you want the probability for event X (typically a "one-off" event). Frequentist statistics cannot give you an answer (one-off events have no distribution to speak of). But with Bayesian statistics, I can compute a probability for it!"
Yes, but as someone else has pointed out, what the heck do you mean by "probability"? Frequentist statistics is fairly clear on the definition. The whole argument given above seems like he is happy he has some mechanism to get an answer, with little thought about whether he is asking a meaningful question.
Which is why your comment resonates with me:
>They preferred to differentiate clearly between a frequency of experimental outcomes, and what researchers think. They figured that Bayesian "priors" were subjective, biased, and untrustworthy.
I don't want an answer that's dependent on how the person thought. That definitely comes across as subjective to me.
>I don't want an answer that's dependent on how the person thought. That definitely comes across as subjective to me.
Then I think you'll be somewhat disappointed when you learn more about philosophy of science and the core debates over methodology. The biggest problem is: nothing is purely objective. Everything involves assumptions of some sort, otherwise we run head-on into the Problem of Induction, white ravens, No Free Lunch Theorems (on the more machine-learny side), and other such problems.
>Yes, but as someone else has pointed out, what the heck do you mean by "probability"? Frequentist statistics is fairly clear on the definition. The whole argument given above seems like he is happy he has some mechanism to get an answer, with little thought about whether he is asking a meaningful question.
I don't think frequentist statistics are very clear here at all! A p-value, after all, is a likelihood, which frequentist statisticians insist is not a probability, but which the math clearly says is a conditional probability. So when you get a p<0.05 finding, it never means, "We actually ran this experiment under a control hypothesis N times, for some large N, and fewer than five came out this way." It's a measure of counterfactual outcomes, conditional on an assumption which we pretend to expect to be true. When the p-value is small, we then pretend to be surprised, and pretend to make an interesting inference.
I say "pretend" because an ordinary NHST is mathematically equivalent to a Bayesian credible hypothesis test with a uniform prior over the hypotheses. Performing the frequentist test involves pretending to believe that uniform prior, even though you probably actually set up the experiment in order to obtain a significant p-value.
In the end, the NHST is a chiefly social practice, and the p-value is chiefly social evidence. It's a way of convincing peer reviewers to accept (that is, subjectively believe) that you did a real experiment, when they would otherwise skeptically believe that you made it all up (which, unfortunately, some researchers have been known to do!).
Bayesian methods don't get rid of this subjective, social component to science and make everything "objective", any more than you can do that by hiring Mr. Spock to do your statistics. Bayesian methods drag the subjective, social component of prior elicitation out into the sunlight where everyone involved has to acknowledge it. They also give you numbers that are actually about the experiment you really did, as opposed to measuring your experiment against an infinity of counterfactual experiments you never really performed.
(And also they're easier with small sample sizes, their results are more intuitive to interpret, and generative models are more intuitive to think about than test statistics.)
All that said, I totally have used frequentist statistics (took a very similar class to yours) when called upon to do so. Fighting a philosophy-of-statistics holy war against your higher-ups in the workplace hierarchy is a really bad idea, so however nice Bayesian or frequentism might sound, sometimes you buckle down and do what ships products and publishes papers.
Your criticism of p-value usage is legitimate. However, this is not core to frequentist statistics.
When I first encountered p-values, even with a frequentist mindset, I saw the huge problem that one could have with them. Many frequentists do not like p-values. I wouldn't be surprised if most actual frequentist statisticians (not those in fields like medicine, psychology, etc) do not like p-value usage.
Attacking p-values is not a valid argument against frequentist statistics.
I'll also add that it seems that many Bayesians are really dying for a number, and because frequentist stats doesn't give it to them, they reach for another tool that will - but with little thought about the validity of the tool. I'm not here to defend frequentist statistics, but just because it doesn't give all the answers, that does not mean that some other tool that does give some answers is correct.
It is equally abusable as p-values. I suppose if a Bayesian says he used Bayesian approaches because it made sense given his problem, that's fine (and in my mind, he is just being a statistician, not a Bayesian). The self-identified Bayesians I always encounter don't fall into that mold. They fall into the category of "Look what I can compute that I could not with frequentist statistics" - but any attempts I have to understand what that number means fails - they cannot explain it either, beyond "this is how I feel".
I'm not really trying to make an argument against frequentist statistics and for Bayesian ones. I'm more trying to point out what each style exposes (by printing it in your papers) or conceals (by leaving it semi-consciously understood from that one class in grad school).
Strongly disagree, tbh. Picking one side or the other in this debate is silly. Don't "be Frequentist" so as to avoid Bayesian model building techniques since you'll end up stuck all the time and don't "be Bayesian" so as to look down upon simple, workable, un-motivated estimation procedures with good performance.
I didn't mean "look down upon ... workable .. procedures with good performance." I meant a more commonsense sort of "private Bayesianism", where you maintain a healthy skepticism of things that have always failed before, and a healthy reliance on things that have always worked before, even when public scientific discourse purports to show you very strong non-Bayesian evidence.
For example, back in my MSc days, I would run a whole lot of metrics on our dataset, and look for patterns. Sometimes I would find a strong, interesting pattern, and go try to tell my advisor about it. He would ask me to double-check my code for bugs, rerun things, and see if the pattern was still there. Often, it wasn't.
My advisor was nobody's Bayesian, a frequentist (and even a user of purely descriptive statistics, oftentimes) through and through.
So to me, "Bayesian on the inside" ends up meaning, "at least Bayesian enough to look for experimental errors." This attitude has helped me a lot in debugging difficult snafus in industry, too.
The Bayesian believes that probability represents our beliefs about the world. The Frequentist believes that probabilities merely represent the long term frequency counts of events.
>The Bayesian believes that probability represents our beliefs about the world.
But what if our beliefs differ?
On several occasions, while tutoring friends who were taking introductory probability, they'd be posed with a HW problem. They would compute the answer in two different ways, and occasionally get two different answers. Both methods seemed correct to them, but they were not - one was always wrong. I used to argue with them about their reasoning on the incorrect answer, but it didn't help much.
What did help? Just doing the homework problem in real life, with a reasonable number of samples. It could be literally in real life or through a computer simulation. The result would always closely agree with one of the answers.
That's why I like frequentist statistics. It gives me a way to settle the answer outside of my own belief system.
If you have different beliefs than you'll have different probability distributions. There's nothing wrong with that.
Subject to a few technical requirements (basically absolute continuity of priors), it's a theorem in that your posteriors will eventually converge as more evidence is gathered.
That's why I like frequentist statistics. It gives me a way to settle the answer outside of my own belief system.
Can you explain this? To me this makes no sense - as a Bayesian I run simulations too.
Part of the problem is that bayesian vs frequentist is one of those things like MWI vs copenhagen or the oxford comma: a certain group of people read a thing and decide having a stance on X makes them part of an in-group. They then flaunt that opinion despite never actually being a statistician/quantum physicist/grammarian or ever actually running up against situation where either option matters in their real life.
The most obvious thing that nobody seems to ever explain is that a statistician can use both frequentist and Bayesian methods. In fact, most good ones do. Frequentist methods are generally better for finding the needle in the haystack, while Bayesian methods are generally better at proving that it's actually a needle and not a piece of painted hay.
Re: Where did the difference come from, that's down to different interpretations of probability. The frequentist interpretation says that probability describes the world, whereas the Bayesian interpretation describes our beliefs. Here's another common misconception: You don't need to subscribe to one interpretation to the exclusion of the other. People who use the Copenhagen interpretation of quantum mechanics (a frequentist formulation if ever there was one) will also speak of fractional belief (the definition of Bayesian probability). It is important to be clear about which interpretation you're using at any one time, but you don't need to tie yourself to one interpretation, and it doesn't need to be part of your identity or world-view.
I agree, and suspect many people (outside of the statisticians who have had the time and space to digest the philosophical underpinnings) who claim to love Bayesian methods do so because they have been told it's the right thing to love. There are a lot of hand-wavy explanations out there that tell you what each side believes, but I have yet to see something that truly ELI5s it.
Well, maybe I'm an exception since I am a statistician, but Bayesian techniques allow me to do things which are simply impossible with frequentist tools. To be fair, I use both approaches just about every day.
"I use both approaches just about every day" is the only sane answer here. Different tools for different jobs. What would think if you met carpenters who described themselves as "Hammerists" or "Sawsallitarians"?
Sure, but there's nothing which would prevent you from addressing that from a Bayesian perspective. In fact, Bayesian particle filtering techniques would probably be a great tool for "on-line" quality assurance.
I'd think they just made it clear how to choose one or the other based on whether I wanted something built and/or destroyed, as opposed to whether I just wanted it cut apart. ;)
Well, if you're looking for a TL;DR to describe all of the differences between Bayesian and Frequentist statistics, while also giving you a history of the theoretical development of each, and you want it to be self contained... you're going to have a bad time.
There are separate theoretical foundations, which can be confusing since both Bayesians and Frequentists use probability theory in the same ways. A short explanation of the foundational difference is that Bayesians and Frequentists use probability in different ways.
To a Frequentist, a probability is nothing more and nothing less than a long run frequency: the proportion of times you expect an event to occur if a random experiment is conducted many times. This proportion is usually conceived of as a true, but unknown, constant. A good Frequentist thus can't describe "the probability that you have cancer", because you either have cancer, or you do not. If you want to see what kind of constraints this places on frequentist descriptions of real-world phenomena, look up the definition of the frequentist confidence interval.
1. Your beliefs should be describable as probability distributions
2. You should update your beliefs when observing new evidence using Bayes' rule
There are solid theoretical justifications for both of these statements.
To a Bayesian, therefore, it is perfectly sensible to talk about the "probability that you have cancer", because there is uncertainty about the phenomenon.
This discussion is, however, almost completely orthogonal to the "applied" implications of choosing a Bayesian or a Frequentist approach to statistical inference. Some thoughts:
1. Bayesian procedures tend to be more computationally intensive
2. non-degenerate Bayesian prior distributions have the effect of "shrinking" parameter estimates towards some null value, which has benefits in high dimensional problems (see: frequentist Lasso and ridge regression)
3. Bayesian inference makes it easy to think about problems in a conditional fashion (e.g., if I knew what "X" was, I would know how "Y" would behave. If I knew what "Y" was, I would know how "Z" would behave."). This makes it quite easy to specify intuitive, yet complex, models of interesting phenomena.
4. There are conceptual advantages to thinking about things as probability distributions.
5. Eliciting prior distributions is hard, but it is also work that any good statistician should be doing (at least informally) regardless of whether they're a frequentist or a Bayesian.
> if you're looking for a TL;DR to describe all of the differences between Bayesian and Frequentist statistics, while also giving you a history of the theoretical development of each, and you want it to be self contained... you're going to have a bad time.
Yes a million times! This problem is mirrored IMO in many domains requiring somewhat complicated math. You end up with an explanation of many layers of concepts flattened into one very hard to grok pancake.
Really though this is false dichotomy, as it's perfectly possible to be both a Bayesian and a Frequentist by using a loss function which sums over both one's hypothesis set and across possible datasets.
More often than not, a good explanation is very simple, if not trivial. The way I see it, the two points of view are exactly that - points of view. They are both valid approximations to what happens in reality, and there are areas in which one of them works better than the other. The whole controversy looks to me ridiculously similar to the one described in Gulliver's Travels where philosophers were engaged in an endless argument about from which side to crack an egg.
From a more technical perspective all this comes down to a simple fact that some consider probabilities within the framework of Information Theory, while others prefer to use a standalone axiomatic foundation.
- Where did this difference come from? When did it develop?
I'm not sure of the specifics but it (a) appears to be a fundamental dichotomy on ways to practice "finding a good model" given statistical mathematical foundations and (b) has been heavily politicized historically.
A lot of statistical historical practice is developing a good general purpose way of finding a good statistical model and proving that it works pretty well under some assumptions. Historically, Bayesian methods were considered taboo (perhaps because we generally lacked the ability to compute them) and so most papers were Frequentist. Very historically (Gauss) Bayesian methods were often used to generate some of the first statistical models used in physics.
- What are the basic premises that a Bayesian believes that a frequentist doesn't, and vice versa? Reason it all the way through front and back.
In basic mathematics both sides share the same beliefs, but in practice they favor different means to construct and evaluate models. See my other answer for many more details, but essentially Frequentists evaluate their models by seeing how much they diverge from reality and Bayesians evaluate them by comparing relative likelihoods of models given what they observe. This leads to wide variations in the means of constructing, elaborating, and talking about models.
- What does the B/F's model look like? What are the pieces they use, how are they arranged, what are the dependencies, how does causality flow?
A Frequentist's model can be literally anything. You might legitimately consider "the minute of the day that the mailman arrived" an estimator for "the expected time when stock A will beat out stock B three months from now" and then you use Frequentist methods to evaluate how your estimator performs. You'll also likely conclude that this estimator is terrible.
A Bayesian's model typically flows from a "generative story" which results in a massively parameterizable model which covers a huge swath of potential realities and then the Bayesian goes looking in that space for the "most probable" model.
Frequentists can use Bayesian methods if they like. Bayesians can evaluate their "most probable" models using Frequentist evaluations if they like. Good statisticians do all of the above.
- Why are the choices made by one invalid for the other's model? Where do they agree deliberately despite this?
We both want to travel from Boston to SF. I fly and get there quickly, you drive and have a great road trip. We both arrive at approximately the same place but our methods and experiences differ. For sufficiently short trips they're even identical.
More to the point, Frequentists and Bayesians disagree about their mechanisms for getting to good models. Really dogmatic Frequentists and Bayesians can disagree about "the meaning of probability" but as far as I'm concerned this has much more to do with decision theoretic policy making and education rather than mathematics.
- What are the consequences in the real world? Give me a real example on why this difference matters? "Real" meaning I don't care about dice, I care about engineering and science.
Lets say you want to model an engineering problem statistically. Frequentist methods will probably end up requiring some leaps of logic and clever tricks to get to the best result but they will also end up with at least a few algorithms you could run on constrained hardware. Bayesian methods will be easier to "plug and chug" in many parts (though they still require a lot of finesse) but the final result will almost invariably require a fast computer.
I'd compare it to integration. One school of thought is that if you're pretty clever you can integrate many things by exploiting their structure to find the antiderivative. Another school of thought is "I can answer most practical questions here through numerical integration at the end of the day, so why both finding an antiderivative?"
Both work essentially but you face very different challenges on each road and some problems can be much easier for one perspective or the other. If you're really good you have both of these tools in your toolbelt and think carefully about when to pull each one out
> understanding the nuances between words
like probability and likelihood
Go into a lab, do an experiment, call that
one trial, measure a number, call that
number (the value at this trial of)
random variable X.
We might want the average value, expected
value, or expectation of X denoted by
E[X].
Under meager assumptions, if we take a
sequence of independent samples of X, then
their average will converge to E[X]; this
is the law of large numbers.
We might be interested in the event,
call it A, when X > 1.
We might want the probability of A, that
is, P(A) = P(X > 1).
For random variable X, we can define its
cumulative distribution: For real
number x,
F_X(x) = P(X <= x)
[Here are using TeX notation where F_X is
F with a subscript X.]
Then with calculus and meager assumptions,
the probability density of X is
f_X(x) = d/dx F_X(x)
With meager assumptions, calculus and
f_X(x) can give us the expectation E[X].
The likelihood of X = x is just f_X(x),
that is, the value of the density at x.
For the Gaussian distribution, the maximum
likelihood is at the central peak of the
density which is also the expectation.
In some approaches to statistical
estimation, we have some data and seek
estimate x that maximizes the likelihood
of getting the data we actually did get.
Given events A and B,
we can define the conditional probability
of event A given event B by
P(A|B) = P(A and B) / P(B)
So, if we think of events as geometric
regions and their probabilities as their
areas (actually part of a serious
approach), then P(A|B) is the fraction of
B that is also A.
Then
P(A|B) = P(A and B) / P(B)
is Bayes Rule.
If we do experiments and believe from
whatever prior to the experiment that we
have a meaningful estimate of P(B) or
P(A|B), then maybe we are being Bayesian.
More generally knowing that event B
occurred we can regard that as
information we have obtained, and what
that information says about event A is
just P(A|B).
Then, if events A and B are independent,
event B gives us no more information about
event A and we have
P(A|B) = P(A)
So, if we are interested in event A and
its probability P(A) and suddenly are told
that event B occurred, then for event A we
now want the updated view P(A|B).
Using the measure theory foundations of
probability and the Radon-Nikodym theorem
of measure theory, under meager
assumptions we can define for random
variables X and Y
E[Y|X]
which is a function, say, f(X), of random
variable X and the best non-linear least
squares estimate of Y for any function of
X.
This measure theory approach also lets us
define
E[Y|Z]
for an infinite set Z of random variables.
This definition is useful, e.g., in the
Poisson process where each increment of
time to the next arrival is independent of
all previous increments, Markov processes,
a stochastic process adapted to a
history, etc.
I just answered one of the user's questions. The user's question I answered didn't get involved in frequentist versus Bayesian and, thus, neither did my answer.
I started my post by quoting the user's question; that's the question I answered.
I never used the word frequentist and made only minimal use of the word Bayesian. I avoided all political fights.
Note: Possibly of special interests to Bayesian, I touched on E[X|Z] for an infinite collection of random variables Z. This will be important in conditioning (the core of Bayesian) in statistics of stochastic processes.
Also of interest in conditioning, I did mention that if events A and B are independent, then B gives no more information about A because
P(A|B) = P(A)
Anyone working with conditioning needs to know this concept.
More generally, I mentioned the sense in which conditioning gives the best possible non-linear least squares estimate; so, here we begin to see the power of conditioning, of which Bayes Rule is the most elementary case.
Also you may find that my touching on the Radon-Nikodym theorem is a first step to high end versions of Bayesian, e.g., the old idea of sequential testing (A. Wald) and the concepts of stopping times, optimal stopping, the strong Markov property, etc. I wrote out an earlier response, longer, I didn't post, that did go back to
sigma algebras, measurability, etc. I did omit measurable selection, sufficient statistics, etc. For such concepts, the Radon-Nikodym theorem and sigma algebras are crucial, and my post may be the only one here that mentioned either.
Also, comparing my response to others, you may find that my response was comparatively clear, precise, understandable, for such a short post without poorly defined or undefined terms, correct, and from a mature view.
By the way, I hold a Ph.D. in applied math from one of the world's best and best known research universities. My research was on stochastic optimal control and passed an oral exam from a Member, US National Academy of Engineering. I've published as sole author peer-reviewed original research in mathematical statistics. Once I did a statistical estimation of expected revenue growth for the BoD of FedEx; my work got two Board representatives from investor General Dynamics to change their mind and stay and, thus, saved FedEx. I've worked in statistical consulting in finance, marketing, etc., including in computing and statistical consulting for the faculty at Georgetown University. My work in statistical power spectral estimation got my company sole source on an important contract from the US Navy. Once I did a Monte Carlo estimation of a statistical estimation I did of the survivability of the US SSBN fleet under a special scenario of global nuclear war limited to sea -- the US Navy was pleased. My work passed review from J. Keilson, a world class expert in statistics.
Maybe, instead of what I wrote, some readers were looking for something else.
An example from Leonard Mlodinow: imagine a friend doesn't pick up the phone when you call. Now, you might think that maybe they're upset at you, because if they are indeed upset, the the probability that they wouldn't pick up the phone is very high. On the other hand, there might be many other reasons your friend doesn't pick up the phone – battery's dead, they went on an impromptu holiday, they're having a bad day, they didn't hear it ring. Frequentist statistics deals in the first kind of probabilities (the probability of seeing what you saw given a particular hypothesis) whereas Bayesian statistics is the other way around (the probability of a particular hypothesis given what you saw).
Given that the alternative hypothesis is usually defined as the complement of the null hypothesis (HA = 1 - H0) this doesn't make much of a difference, though.
I find many people have trouble expressing a reasonable hypothesis / null-hypothesis pair. In fact, I'd bet that a good chunk of folks would try to make "upset" be the dependent variable in the phone call scenario.
My favorite explanation is that frequentist methods answer the question "If I assume a model, A, what is the likelihood this data came from it?" while bayesian methods answer "Given this data, what is the model?"
Frequentist methods rarely directly answer the question we actually have. But they're generally far easier to compute. Bayesian methods are often much more intuitive but are far more complicated and less performant.
Meh, most Bayesian techniques still assume a model. It's more like:
Assuming a model M characterized by parameters T and giving rise to data Y, what is P(T|Y,M)
To be sure, you can compare the probability of models as well, and there are Bayesian semiparametric techniques, but models are still really important.
Which is why no working statistician is really 100% Bayesian (intractable in many cases) or 100% frequentist (obviously wrong in some cases). We all use Bayes' Rule (don't need to be "a Bayesian" to do that) and we all are forced to do Newman-Pearson-style power calculations now and then (holds nose). Even the latter have their uses, in preventing the worst of the worst abuses of frequentist techniques (it's not frequentism that is inherently bad per se, it's the profound abuses that turn it into a magnet for bad science; sometimes a Bayesian formulation of a problem is simply intractable).
Frequentists like to give you a perfect answer the first time by computing everything at once. Seeing the world like pure math. For example giving an unbiased dice of 6 faces, the likelihood to have a 6 the first time is = 1/6.
Bayesians like to walk in the park, and see step by step how things go. The more they walk, the more accurate the result will be. For example at step 5 with 1st roll: 1, 2sd: 6, 3rd: 3, 4th: 1, 5th: 2, you will have (1 + 0 + 0 + 1 + 0) / 5 = 0.4. At step 6 with a roll 5 you will have (1 + 0 + 0 + 1 + 0 + 0) / 6 = 0.333. The answer being closer and closer to the true answer after each roll, each step. Ultimately with enough rolls, bayesians will start to give you an answer close to the frequentists' one.
They start to diverge on the issue of what a probability actually is: frequentists see them as long run averages, and Bayesians see them as degrees of beliefs.
If you have a coin that comes up heads 60% of the time, a frequentist looks at that as, "as the number of times I flip my coin goes towards infinity the proportion of heads I get goes to 60%." A Bayesian thinks, "absent other evidence on the how the coin gets flipped, I'm about 60% sure the outcome of the flip will be heads."
This lets Bayesians talk about the probability of single events, like basketball games, where frequentists can't.
Bayesians also see conditional probabilities everywhere. A conditional probability says, "well, if I know something about the situation, I should include it in my beliefs." Circumstances matter. It doesn't make a ton of sense to talk about the chances I get hit by a car. They very wildly depending on whether I'm standing on the highway or eating in my kitchen.
Another basketball example. The chances that Spurs win changes dramatically if they play the Warriors or if they play the Kings. I might say "they have a 90% chance of winning given they play the Kings, but a 40% chance of winning given they play the Warriors.
I also need a likelihood function. What are the odds I saw my data given my hypothesis is true. If I got hit by a car, what are the chances I was standing in the street? Given the Spurs won, what are the chances they played the Warriors?
We use something called Bayes Rule which allows us to pile on more and more information on something we call a "prior belief," what we thought about our hypothesis before we saw our data. As we pile on data, we expect to change our beliefs. We become more sure of what we thought, maybe we become less sure, maybe we can totally change our minds.
I want to use Bayes' example since I think it's so good. Imagine you came out of Platos cave and saw the sun rise. You'd think, that's weird, I bet that doesn't happen again. The sun goes down, and you spend some time in the dark. The next morning the sun comes up again. Now you're less sure that sun rises are fluke events. Plus you found some people who aren't freaking out about the whole "big ball of fire in the sky" thing. Maybe now you don't expect the sun not to rise tomorrow. Maybe it will, maybe it won't. As you see more and more sun rises, eventually you get to the point where you are extremely confident that the sun rises every morning. You saw more sunrises and updated your beliefs.
We need one last piece of information: a prior. That's that initial belief you're cave-escaping-self had that sun rises are weird and you probably won't see another one. We can estimate them through population data--percentage of games the Spurs won against the Warriors--or we could just make them up. This is just our belief about the truth of our hypothesis; the chance I get hit by a car regardless of where I am.
We take all this put it into Bayes rule, a blender that gives us the probability that our hypothesis is true given we saw our data. We can use this as a new prior too.
One last example. I'm 70% sure the earth is round. I see a picture of the horizon taken from a hot air balloon and I think there's a 90% chance that I would see that if the earth were truly round. Without going into the calculation, I'm now give or take 80% sure the earth is round. I saw data to support my hypothesis and my belief got stronger.
Why do Bayesian analysis? Because someone once published a study that says that frogs can sense earthquakes some time before they occur. That may be true, but I'm skeptical. My skeptical prior would only get moved slightly to become less skeptical, but it would still need more information, a replication of the study by other people, to actually convince me.
This guy/girl nailed it. It took me a while to realize the distinction really exist in the method constraints imposed on the process of reasoning within each of these paradigms. I found it helpful to think of it as two different computational models. That although often agreeing on their outcome, the process of arriving at an answer differs quite significantly.
They both yield the same results when your knowledge of the given probabilities is exact
But Frequentists will look at a "top down" view, Bayesians will look "bottom-up", more importantly, as based on Bayes's theorem, they will look at "if a then b" kind of probabilities.
That XKCD is a joke and not how frequentists view statistics.
The confusion comes from the way you read papers. A frequentist looks at a paper as evidence but not truth. Bayesian on the other hand gets to the same place with slightly different math.
The joke is about priors selection. I'd say it's spot on, since, as you say, the only reason people are not fooled like that is because they evaluate the frequentist results as evidence in an ad-hock Bayesian model.
The truth is that nobody thinks exclusively on frequentist or Bayesian terms. But that's not comics-grade material, and mixing them would hide instead of surface their differences.
The problem is it's a single sample. If the output was Yes, Yes, Yes, Yes, Yes, No, Yes, then you can do frequentist statistics, but when the output is just 'Yes' then the sample size is one.
What's the standard deviation of a sample size of one?
Now, the standard counter example is a composite statistic. Like roll 100 6 sided dice get 600 and assume they are not fair. But, importantly there is a standard deviation assumed in the experiment and there was more than one dice roll. However, if you combine that with something else then your sample size drops back to one.
The p-value is defined as p=P(evidence seen | null hypothesis). The standard deviation is only relevant if it is required to compute that number. You can run NHST on distributions without a p-value, e.g. a Cauchy distribution.
You might need a standard deviation if you want to do some naive Z-test based on the CLT approximation (since the normal distribution requires a standard deviation), but that's not what XKCD was describing. XKCD was describing an exact test using the true distribution.
True distribution is not what they where using. They assumed the dice had zero bias which is never true for any physical system.
I can say this is a dice and therefore it should have distribution X in theory. But, that does not mean it's actual distribution is X without testing. Further, even after testing nothing says the distribution will be unchanged.
Note: The above seem pedantic, but it has significant real world implications.
All you're saying is models are imperfect. That's true. That doesn't mean you need a standard deviation or more than 1 sample.
In this case, the model of 1/36 odds of rolling 2x6 would have to actually be 1/20 (or smaller) to invalidate this test. Do you find it plausible that the bias in 2 die is that high?
In that specific case yes, because there was no dice roll it was just a comic.
In a wider context that single data point is evidence that the detector was tripped or was not tripped. But, unlike a Bayesian the frequentest does not say they then know the actual probability involved and they don't update their priors. Because, to do it correctly you need to pick a P value and a model before doing the test.
Significant: https://xkcd.com/882/ makes a similar mistake by assuming a frequentist would accept that study design before running the tests. Multiple tests require more evidence, though when multiple groups are involved and not all publish you do get this problem.
Which is it's own problem. A Bayesian is often happy to look at any data that agrees with their own interpretation which is why it's not useful for papers.
The idea that A cause cancer is ridiculous. Collect data, well the A group has 10x as much cancer, but that's ridiculous so I conclude there is no relationship between A and cancer.
This is a very interesting comment for someone like me who has little knowledge of statistics. The stack exchange answer shows how stupid Bayesians can be, and the XKCD shows how stupid Frequentists can be.
Yet I find this particular criticism of Bayesians not fully convincing. The Bayesian approach is to take the existing knowledge (where the phone was often list in the past) with new knowledge (where the beeping is coming from) to come up with probabilities (where to look for the new phone). This seems to me to be the correct approach in general, it's just that in the case of the phone the new knowledge almost entirely outweighs the existing knowledge.
Statistics is about finding a model of the world that we can trust. A model in this circumstance must be one that makes predictions about the world and therefore we trust it when we expect that its predictions will be largely correct—or at least more correct than any other model we have.
Frequentists and Bayesians disagree on the processes for building and evaluating models. Their techniques are often complementary and are, in current and historical practice, almost always used together by professionals. I would call them two sides of the same coin, although some take philosophical perspectives which are more dogmatic.
The theoretical foundation which divides them (confusingly called Bayes' Law) states that there is a relationship between "the probability of seeing some event when a model is true" and "the probability of a model being true given some event happened". In short, Frequentists tend to build their processes off of the first notion and Bayesians off the second.
In practice, Frequentist methods build their models via whatever tools they like. These often include basic optimization tools for picking the best set of "parameters" of a model. They then evaluate the performance of these models by asking "how unlikely was reality given this model was true?" and rejecting models which fail to predict what happened.
Bayesian methods are more fixed but also more dramatic in their ways of constructing models. They tend to create vast models with many moving parts using what's known as a "generative story". This is acceptable since they use the data they observe to compute "probability of truth" for all possible permutations of their model. This is considered a final result since someone might want to ask "how much more likely is model A to be true than model B?" but Bayesians will also at this point use optimization techniques to find "the most probable model".
In many cases these two approaches arrive at the same places. In times that they do not they provide interesting questions about what we really mean when we say that we "trust a model" and this leads to endless discussion. It's also often the case that "avowed Frequentists" have historically used Bayesian methods to discover their basic models and then evaluated those models in a Frequentist fashion for publication (Fisher was known to do this). This arose because at a certain point in statistical history Bayesian methods were not socially acceptable. Finally, it's a pretty good idea for Bayesians to evaluate their models in Frequentist forms in order to have more ways to discuss how their models perform.
Probably the last and most practical difference between the two is that Frequentists methods are often built taking into account their time and space complexity. Frequentists are more likely to evaluate the performance of various extremely simple estimation techniques and pick the best. The Bayesian process nearly always results in an extraordinarily difficult to evaluate integration problem that requires modern computers to get results out of. That said, Frequentists often arrive at their best results via "strokes of genius" while Bayesians can usually chug through any modeling problem and arrive with a decent (again, computational only) model.
I'd also like to point out that there's a huge policy impact of one method over the other. Since Frequentist and Bayesian methods disagree in how we should construct and evaluate models then almost of all scientific practice is impacted. In practice it seems that good science can arise from either method, but duality here leads to a great deal of policy confusion since things that are taken as holy by many scientific practitioners are given to questioning when you introduce a complementary method.
You can also think about this in terms of guarantees. Frequentist methods can give you confidence about the long-run performance of a method, by controlling the familywise error or false discovery rate. In other words, this procedure will only be wrong $\alpha$ percent of the time. Bayesian methods don't really give you that, but they may give you a more coherent summary of the state of the world right now.
If clinical trialists use p-values wrong, how is moving to Bayesian methods going to be less misused and misunderstood?
The real issue is the the established practice in the research field. It's hard to introduce new methods if peer review is not familiar with them, or if everyone in the field has problems with such basic concept as p-values. Researchers have the tendency to apply statistical tools they have learned mechanically and peer review accepts them mechanically. New tools need more thought, not less.