The title is worse than the actual post. You dont always need Monte-Carlo. You need integration or summation (a distinction that can be abstracted away) and for some forms, Monte-Carlo is a way to evaluate those integrals approximately. As the post indicates there often are cheaper alternatives, more so when one is willing to tolerate inaccuracy.
In case the author is reading this, I really meant to say that the article is much better than what the title might indicate. I realized that my original comment can be read as if its casting a shade.
Beats the trapezoid rule hands down though, in fact that was the main selling point of MC over other quadrature methods. MC error goes down as O 1/sqrt(n) with quasi MC one can get it down to O 1/n.
In this particular case you don't have to choose between Monte Carlo and using a normal sum approximation. You can just do a non-stochastic numerical integration (e.g. simple Riemann integration or whatever other method you want to use), since it's only a double integral, and thus the dimensionality isn't so high that you need to resort to stochastic Monte Carlo.
Let me take a stab at answering. And I'll answer 2 questions, not 1.
1. Why do we need Bayesian estimation? Can't we just all use MLE, and live in harmony? If MLE was good for Gauss and Laplace and Fisher, why isn't it good for me? Answer: for many problems the likelihood function is unbound. In fact the cases where the likelihood function are bounded, so the MLE makes sense, are few and far between. But if you think of the likelihood as a function with a hump somewhere, you have several choices for a single number that best describes the function: mode, mean, median for example. MLE corresponds to mode, can we use the mean instead? Also, does it not even make better sense to use mean rather than mode? Sure it does, and if you use mean you become a Bayesian. But can the likelihood be unbounded and have a finite mean, you ask? Yes, it happens quite often, sufficiently often that in practice people don't worry about that.
Once you start using mean instead of mode (and therefore you can proudly call yourself a Bayesian), you get a number of side benefits: you get parameter uncertainty estimation for free, and you can start multiplying your likelihoods with various functions that make your life easier. You can interpret those functions as regularization functions, or prior, you can even start talking about expert knowledge, and delve into philosophical meanderings. No need to do that , there are enough people on the internet who do that.
2. Why Monte Carlo? Because calculating means of high-dimensional distributions is not easy. In most cases you are hit with the curse of dimensionality. Separately, the distributions that you work with are not known only up to a multiplicative constant. So the problem is how do you estimate the mean of a vector (v1, v2, ...., vn) when you know its distribution, but only up to a constant? The answer is Markov Chain Monte Carlo (MCMC) and if you learn that, you get special powers.
So why do Bayesians need Monte Carlo? Because they like to be like super-heroes and have special powers.
The title seems to assume that Bayesian statisticians do not know Monte Carlo methods. But in fact, Monte Carlo methods are part of Bayesian statistics. They are in any decent Bayesian stats textbooks.
It kind of does though? I.e. I’d expect that article to be about why you need to keep adding water to the human body (by drinking water), whereas an article about “why the human body is mostly water” would explain why it is composed of water rather than why it needs more water.
I think maybe a more interesting question is how to explain the extreme disconnect between data volume and confidence in the two examples compared. The election has a huge number of polls, analysts and subject matter experts focused on it. And yet our ability to make predictions about it is comparable to an A/B test where 30 observations have been made total?
The only sense I can make of this is that:
- Predictions about stochastic dynamic systems are hard, especially when there can be these exogenous variables intervening from out of nowhere.
- Competitive situations are hard to predict, especially if they change based on the measurements produced about them (i.e. the campaigns strategize based on polling info). This effectively makes those measurements less predictive.
To me, this suggests that all the polling and analysis are a waste of resources and attention. The expected value of information is extremely low. We should kick off campaigns on Halloween, talk about them for a couple days, and vote immediately. All the pollsters and analysts can be more usefully deployed towards studying other systems.
This seems to be assuming that the value of all this coverage and analysis is in the correctness of the prediction.
It seems more likely that the media companies profit enormously from all of this coverage in spite of (and perhaps even because of) the low certainty provided by this polling and analysis.
Somewhat of a side-question, but how would one go about about finding jobs where knowing Monte Carlo methods is a primary requirement? It seems most companies want a software developer that knows a bit of this, rather than a statistician/mathematician that knows a bit of software development.
I work as a mathematician/tools engineer in casino gaming (class 2/bingo based slot machines), so I think I fit this description (math first, software second).
People that have this skillset (knowledge of Monte Carlo methods for statistical analysis/verification) are useful in this sphere.
An example problem we deal with is as follows:
We have some arbitrary histogram (list of slot game awards and probabilities), and we wish to map this to a particular game of bingo (N ball calls out of M balls in total, 5x5 bingo card). The constraint here is that when you come up with a list of bingo patterns for awards, they are ordered in the sense that when evaluating the set of bingo patterns, from top-down, the first bingo pattern that matches is awarded and the evaluation stops.
Getting the optimal fit theoretically in the general case requires exponential time, as the chain of P(this pattern given (not-that pattern and not-other pattern and not-...)) grows with an exponential number of operations in the theoretical expansion. Each time a new pattern is fit to an element of the histogram, the probabilities of all remaining possible patterns to fit changes.
There are a few ways of tackling this problem in a more optimal way: If you can solve this with good enough accuracy in linear time with a good math/cs background, you’d be useful to any company playing in this space.
In general what you’re looking for just doesn’t exist. In niche cases it might, and your best bet would be government / military / academic research labs. Those jobs pay comparatively worse but usually have much better work/life balance & job security and enable you to focus on specializations.
In industry, you just won’t find this. Statistics is a small domain specialist component of a much larger software ecosystem. You don’t require advanced statistics, you just might want to try it or derive a special benefit from it, and none of that will be worth it unless it efficiently connects to the regular software development life cycle, architecture patterns, testing, and maintenance processes that the larger system requires.
Someone who can do “a little software development” would be a liability in that scenario, no matter what their other domain specializations are.
That’s why you’ll find most data scientists and ML engineers professionally are highly skilled at software engineering. They approach statistical modeling tasks from the point of view of standard software development life cycle tasks where the components just happen to involve statistical models or techniques.
I worked as a quant for several years at the start of my career. Being a quant is 95% software architecture and engineering and 5% stats / SQL / spreadsheets and analytics.
I assume the results of `betamaxinv` are not unique, and so perhaps there need to be constraints, or return a family of results. I don't know. I'm trying to wrap my head around how OP got A ~ Beta(2,13) and B ~ Beta(3,11) without just brute forcing numbers into `betamax`.
Yes he did, in two parts: easy non-brute-force ways exist (as in the Monte-Carlo integration in R, he argues why it isn't brute force) and also a non-easy analytical solution exists (when at least one set of parameters are integer, in the linked cross validated post).
He also explains that easy analytical solutions often don't exists to seemingly easy questions, such as this one. That's why Bayesian statistics needs Monte-Carlo.
You are right, I'm the idiot and didn't read the question carefully...
About the actual question, I don't think that's very well specified in this form. You could have A~Beta0 where Beta0 is very concentrated around let's say 0.5, and then find a B~Beta1 where Beta1 has 71% of its weight above 0.5. That would be a pretty good approximation for it, in fact as close as you want if you increase sample size of A. I can't verify this, but I think finding Beta1 should be easy. Obv this is not what you are looking for though, because this would require A to have a much larger sample size than B. I guess we should add some restriction, like similar sample sizes.
I think you mean Newton-Cotes quadrature, but be careful. “Newton’s method” will 100% of the time be taken to mean the Newton-Raphson root finding method instead, which is widely taught as “Newton’s method” (while in quadrature it’s always referred to as Newton-Cotes).
That makes sense in the analytic case, but you mean in the monte-carlo case? I presume just 4D Newton's method on the final integral over the pdf_X to get the value we're looking for?
It makes sense, I'm just surprised the blog never comes out and says "you have to do a brute force search to get the parameters for the likelihoods A ~ Beta(2,13) and B ~ Beta(3,11) for P(B > A) to have about the right value".
Your comment sounds very confused, can you elaborate? What do you mean “in the analytics case, but you mean in the monte carlo case?” I can’t parse that sentence as a response to the parent comment.
Further what do you mean by
> “ the blog never comes out and says "you have to do a brute force search to get the parameters for the likelihoods A ~ Beta(2,13) and B ~ Beta(3,11) for P(B > A) to have about the right value".”
This is not a good example of where Monte Carlo methods are needed because solving this problem with Betas can be achieved by summation over the joint posterior where P(A) > P(B).
I think there is something left out of the post, but it's hard to put my finger on it. In a vacuum, 71% does sound like a relatively low amount of evidence in favor of Biden, but when you put it next to most elections, which are only won by a thin margin, it sounds like a very impressive amount of evidence. Further, considering the amount of polarization in the USA, it seems like if 71% of respondents favor Biden, there is little probability of them switching to Trump.
I agree that 71% is a low amount of evidence, but election forecasts really make my brain work to rationalize them. I think it's helpful to remember that forecasts are produced _after_ weighting them by states' electoral college shares.
They are, and I guess most people are too, unconsciously. That's why not winning while having a predicted 71% chance somehow becomes such a big thing. You hear "election" and "71%" in the same sentence and you immediately think that's a huge margin.
This - I believe - is exactly why those prediction are so misunderstood. Most pollster usually give the expected vote share of each candidate. In this regard, a 60/40 split between two candidates in the polls means that the first candidate is nearly guaranteed to win the election. And for most elections, the split usually is much closer (a 53/47 split is already considered a "wide" margin). So years after years you see split like 55/45, 52/48, 57/43, ... and the expected candidates always wins.
So when someone analysing an election comes up with a 70/30 split, the number seems incredibly high - there is now way we are getting this one wrong. It takes at least a second look to see that this guy is talking about something different and that the race actually is quite close this time.
> Our forecast is up!!! It gives Joe Biden a 71% chance of winning and Donald Trump a 29% chance
I think the confusing thing here is that at first glance it looks like "Joe Biden will get 71% of the votes" which is a landslide victory. If it said "there is as much chance of Joe Biden winning as rolling a 1-7 on a 10 sided dice", which means almost exactly the same thing, no one would get surprised if Trump won, because we are all familiar with dice rolls with probabilities in the 0.1-0.9 range.
Definitely not an expert but there's a fundamental hurdle to this.
One way to think of this is to realize how easy it is to multiply numbers but how much more work it takes to divide numbers.
For something like automatic differentiation, you're essentially applying the chain rule for partial derivatives repeatedly. This is analytically pretty straightforward to do for most applications. All you need is an analytical derivative for the simple functions your more-complex function is comprised of (e.g. a neural network).
For integration, the analogue of the chain rule is integration by substitution [1]. The toolbox for solving integration problems is more limited than for differentiation. You run into issues where the answer cannot even be expressed using standard mathematical notation [2]. Sometimes you get lucky and the answer can be expressed via an alternating Taylor series so you can estimate the answer within some margin of error [3].
Stan is a piece of software that runs state-of-the-art MCMC methods to basically just compute fancy integrals. A Stan model will take an order of magnitude more time to run than a simple neural network via something like PyTorch on the same dataset. But they answer different questions.
Slightly tangential fact I heard recently on a podcast [0]: historically the problem of integration preceded that of differentiation. But because differentiation is so much easier, we teach that to kids first. It's one of the many ways that schools obscure the intuition of calculus and turn it into so many formulae to be memorised.
You're assuming that you want to express the integral as an elementary function, or as a member of another narrow class of functions. It's only in that case that the lack of a chain rule is a problem. What you actually want is the ability to evaluate the integral to within epsilon numerically, which is a much more flexible problem.
But then again, it's not clear what the OP meant by "automatic integration" anyway.
Can you please elaborate on "you never get the expression"?
When you input the variable values into a symbolic derivative you just get a value at the end. d/dx x^2 = 2x. If x = 0.5 then d/dx = 1. The same is true for symbolic integrals. For most practical applications, we don't really care about the full symbolic expression. We just want the answer, or at least a good approximation. This post uses a specific example of the difference between two Beta distributions. We want to get that 0.71. It is very hard to "automatically" make that happen.
Just to clarify: This result follows directly from the definition of the derivative:
f'(x) = lim_{e->0} (f(x+e) - f(x))/e
If you're able to express f(x+e) on the form (f(x) + y e) then it follows that y is the derivative.
It also should be noted that auto-diff doesn't let you skip the rules for derivation and you're using the same calculation as you would to show that e.g. f'(x^n)=n*x^{n-1}.
Closed form integration is fundamentally different.
What's the integral of exp( sqrt ( 1 + (tan^(3/2) X)2 ) ) ) ?
We only know a handful of forms that can be integrated in closed form and its down to our creativity to discover new forms that can be integrated (same deal with solving differential equations and the reasons are the same).
The forms that we know how to integrate can be done by a computer. CAS tools will do that for you. For example Mathematica.
The reason why auto-diff is exact and amazing is because (1) the derivative operation is "decomposable" through the chain rule and (2) we've found the symbolic derivative of the basic functions. There's not a mathematical trick in auto-diff which we recently discovered. It's mainly a reformulation of the properties of the derivative which turns out to be useful for computers.
Finding an "auto-integrate method" would probably involve finding a way of calculating the integral in a decomposable way, and that indeed would be amazing, but I don't really see that happening any time soon.
on why integration is a fundamentally harder problem than differentiation. New techniques would be required to programatically analytically solve integrals as well as current differentiation programs, but it would be exciting to see.
I actually have an idea about how it would work that side steps all of this. I need more time to develop it but watch this space. But I agree with the local vs global relationship, in fact the whole trick is about handling this.
The NUTS algorithm in hybrid monte carlo would be pretty close to this.
The problem is that integration suffers from the curse of dimensionality much worse than differentiation.
Differentiation is a local activity. You approximate a derivative near any point. Integration is non-local, you must aggregate a function over a span of points in the domain space.
As the dimensions get higher, differentiation merely must apply local approximation to each successive dimension of a point in a grid. It only adds the computation cost.
Integration requires exponentially more data (for a given level of accuracy) because adding a new dimension enlarges the entire integration domain space that must be ranged over, and worse you generally need all that new “volume” of the enlarged space to be covered evenly to avoid bias.
So there are some fundamental asymmetries between differentiation and integration that play a role in why “automatic integration” isn’t as straightforwardly feasible as for differentiation.
The selling point of Monte Carlo is that regardless of dimension the estimation error is going to go down as 1/sqrt(n). That hides the dimension part in the 'constants' and the sampling routine.
Usually one cannot afford to visit all of the space. One visits high density regions preferentially.
Yes, agreed, I just meant why we won’t see “automatic quadrature” in the same sense as “automatic differentiation”. It’s always going to rely on adaptive sampling, for exactly this reason of dimensionality.
Integration is well behaved numerically, but poorly behaved algebraically.
Differentiation is poorly behaved numerically, but well behaved algebraically.
Therefore if one wants to do "automatic integration", one has to approach it in a radically different way to autodiff. And arguably, such a method does already exist; it's just very inefficient.
> This assumes a little too much about how automatic differentiation would work.
The use of 'would' implies that automatic differentiation has not been established. Also, if something is 40+ years old, then it follows that its foundation is even older.
Because there is no known algorithm for deciding equality. It becomes automatic if we assume an oracle. Since deciding equality is an open problem and the answer is probably 'no there isn't a decidable equality algorithm' it is as good an answer as we are going to get.
You are assuming that we care about expressing the integral as a function. Given that with autodiff, we don't actually get the function expression, presumably with autodiff, we also wouldn't get it.
In automatic differentiation we depend on the fact that a closed form elementary solution exists even if we don't calculate it directly.
For integration we need to prove that such a object exists for each expression we want to integrate before we try integrating. Which as far as I'm aware is a problem equivalent to that solved by the Risch algorithm.
It's been over a decade since I studied it, but from memory you could also end up in infinite loops when you start generating the closed form solution if it doesn't exist. Which with as much hand waving as possible is ultimately equivalent to the computer function form auto integration would produce.
In a sense, probabilistic programming languages are automatic (approximate) integrators. Programs can represent arbitrary measures, and the back-ends provide various methods of estimating integrals with respect to them.
And how would the similarity work? The derivative is defined at a point, it makes sense to calculate f(x) and f’(x) together. What integral are you talking about precisely?
Integration is a computable operator ([1]), so arguably such a thing does already exist. The difficulty though is in computing it efficiently, especially in high dimensions.