I really wanted to like this article, because I came accross the term MCMC quite a bit, but never knew what it was exactly.
I'm slightly disappointed, though. Almost all of it is dedicated to introducing prerequisite terminology (prior, posterior, likelihood, markov chains) which probably a lot of readers will already be familiar with, and then the actual explanation of MCMC is just this:
> To begin, MCMC methods pick a random parameter value to consider. The simulation will continue to generate random values (this is the Monte Carlo part), but subject to some rule for determining what makes a good parameter value. The trick is that, for a pair of parameter values, it is possible to compute which is a better parameter value, by computing how likely each value is to explain the data, given our prior beliefs. If a randomly generated parameter value is better than the last one, it is added to the chain of parameter values with a certain probability determined by how much better it is (this is the Markov chain part).
"subject to some rule", "it is possible to"... I feel like this is really glossing over the actual explanation of how this works. Actually, I still have no idea why there is a Markov chain. What is the structure of the chain? Where does it come from and why can't we just sample the parameter without using a chain?
Anyway, I appreciated most of the article and it started out really promising. It would be great if the author could try to expand on that still-baffling part ;)
It's a Markov chain because you use the previous sample to sample the next. A chain of random variables where each variable depends on the previous one (and only the previous one) is by definition a Markov chain.
The theory of Markov chains also provides a useful theorem, which is that (under suitable conditions) the distribution of each sample converges to a particular 'limit' distribution. In fact it converges to the so called 'stationary' distribution. It's called stationary because if you start the Markov Chain at a sample from the stationary distribution then all following samples of the chain will have the exact same distribution.
For the finite case, if you put the probabilities of transitioning from state i to state j in a matrix M then the stationary distribution is a vector p such that Mp = p (or pM = p depending on what convention is used).
The MCMC methods consist of trying to design a Markov chain which has a useful stationary distribution. This allows you to sample the stationary distribution by sampling the Markov chain.
I humbly submit my article as an alternative. I clearly explain why there is no alternative, and give a concrete description of the markov chain. Though after the initial description of what problem MCMC is solving, I do dip into the math.
Your article is interesting. But I see a problem in your exposition, in MCMC we don't know the probabilities of individual events. For example let p_n proportional to 1/n^3. We know that 1/n^3 converges but we don't know the sum. The key factor here is that we can compute the transition probabilities without computing the sum of the series. In the continuous variable setting we can compute the transition probabilities without computing difficult integrals. I hope you can incorporate this key factor to improve the quality of your article.
Let X be the random variable that takes values in natural numbers n>1 with p(X=k)=A/k^3 with constant A such that sum p(X=k)=1, you can't apply MCMC with ypur explanation since you don't have an oracle for p(x=k). To apply MCMC you only need an oracle for the quotient of probabilies, here the transition probabilities are k^3/j^3 and you don't need to compute the unknown constant A.
I don't think this distinction is key. You're talking about a detail of the Metropolis-Hastings algorithm, while I'm defining a problem that Metropolis-Hastings is one of many algorithms to solve. Sure, Metropolis-Hastings might not need an oracle, but providing an oracle doesn't make the sampling problem any easier, and the way I wrote it makes it clearer why the sampling problem is hard. "Avoiding a difficult integral" is an engineering detail.
On one hand, mathematically speaking, the problem of sampling a finite state system when the probabilities of each state are known is a trivial one, as you know very well since you have shown that method in your posts (computing cumulative probabilities ...)
On the other side, your blog is about computation and math, and in the realm of computation the complexity of algorithms is a key factor. Mixing time and burning time of markov chains are also computationally important.
The curse of dimensionality is a cause of Monte Carlo methods to be used instead of other numerical methods in high dimensional problems.
You should communicate to your audience that the advances in stochastic sampling and its application to science has revolutionize the field not because a breathrough in maths but because great key insights that allow us to explore multidimensional problems effectively with the help of computers.
Another important factor is that MCMC is used extensively in Bayesian statistics, but to explain that main application of MCMC in your blog it is necessary to introduce concepts like Bayer's theorem, likelihood, posterior density function, and prior probabilities.
There are some blogs in which all those concepts are illustrated and code in R or python is provided. Perhaps MCMC requires more than one post to be adequately explained.
To sum up, I am a follower of your posts and I appreciate the effort you take to show the main ingredients and the python code. My only desire is for you to continue improving the content of your blog for us to follow and enjoy.
Merry Christmas to you and your family from a follower in Spain.
>>Actually, I still have no idea why there is a Markov chain. What is the structure of the chain? Where does it come from and why can't we just sample the parameter without using a chain?
If the parameter can be sampled from the posterior directly, then no chain is usually needed, just sample it. However, most often you can not directly sample from the posterior. Note that this is not necessarily because "posterior is hard to analytically compute". Most times you know the analytical expression for the posterior distribution (up to the normalization constant) but this does not mean you can generate samples from it easily.
So, instead, you construct a Markov chain, from which you can sample easily, and which has the property that in the limit its distribution coincides with your posterior. Then you sample the chain for a long time an hope for the best.
Any chain for which you can show that its stationary distribution is the posterior works in principle. But some chains are better than others. There are also some generic ways to construct such chains, most famous are the Gibbs and the Metropolis-Hastings samplers.
Finally, do not sample blogs by "wikipedia researchers". Sample books like "Bayesian Data Analysis" by Gelman et al. They will not lie to you.
You can not sample the distribution directly when “it is impossible to solve for analytically”. There are a few methods to overcome this issue, in particular “MCMC methods allow us to estimate the shape of a posterior distribution in case we can’t compute it directly”.
You only need to be able to compute the relative values of the function you want to integrate (in this case a probability density) at the current point and the proposed destination. Where is this function coming from is not relevant to understand MCMC methods, they can be applied in many problems unrelated to Bayesian statistics.
I think probably the single most intuitive way of implementing a Bayesian model for a programmer is ABC: https://en.wikipedia.org/wiki/Approximate_Bayesian_computati... because it is generative and inherently algorithmic. I wonder if there is a good tutorial way of starting with ABC and developing into MCMC and other more efficient methods? It's easy to transition from simple Monte Carlo methods to ABC rejection sampling, but I'm not sure how to link up ABC and MCMC (in part because I don't grok MCMC myself well enough to explain it).
This is like describing a painting to a blind person. Describing "a dust-covered sunflower" as a "yellow flower" doesn't much help the blind person and certainly doesn't help the sighted.
It's silly to say something like "A non mathematical introduction to this mathematical concept". What the author really means is that he's not using equations in his introduction. Far too many people think equations are what mathematics is solely about, and titles like this perpetuate that stereotype.
This post would have been better if it actually used equations. A few short equations are easier to understand that a bunch of lengthy paragraphs. In fact P(A|B) = P(B|A)P(A)/P(B) is compact and not hard to understand, but there are entire books written about it!
I agree that equations should have been introduced, but I disagree that your example is not hard to understand, which is the point. I'd rather read about the concepts than look at esoteric equations.
It's silly to say something like "Far too many people think equations are what mathematics is solely about". What you really mean is that far too many people think series of statements containing numbers and variables are what mathematics is solely about, and comments like this perpetuate that stereotype (ie. all math statements containing numbers and variables are equations)
And that's why arguing about word meanings is not a good idea. I think referring to written math as just math is reasonable, and referring to it as equations also is.
Why call it a "zero math introduction" to a math topic? Kind of twists my guts a bit to read that. Seems to suggest some of a wannabe or love-hate relationship with math that you don't want to know any math but you find this math topic interesting?
One of my gripes with the current education system is that it trains the pupil to simply regurgitate knowledge without ensuring that the pupil truly understands the topic - in this case MCMC methods.
Articles such as this one are a great resource to help what would be a mindless regurgitator actually understand what the whole point of an MCMC method is for. The 'why' to the 'what' essentially. And you don't need math to explain that.
There's an argument that you need math to explain it because it is math -- just not what we're taught to think of as "math".
One of my gripes with the current education system is that it makes it hard to recognize when math appears in situations not explicitly involving numbers, equations, and matrices.
It reminds me of a text that Feynman wrote about when he gave classes in Brazil, the students knew all the theory and the math behind it, but they could not recognize the phenomenon when it happened in the real world.
So I guess that is not only in math, but in the entire corpus of knowledge, our system makes us prepared for grading tests, not for applying the knowledge.
He told a room full of students that a French curve (that extremely curvy shape used in technical drawing) has the remarkable property that the tangent of the lowest point is always horizontal.. and they all believed him.
Well it's true, isn't it? I think that point was that they were astounded that the curve had that feature because they didn't understand why it was always horizontal. They thought it was magic.
Uh the point seemed that they knew the theory (of differentiation, tangents, trig etc) but were absolutely helpless in practically applying it. Couldn't relate the theory they had learnt to what they knew of the world, at all. Didn't think 'But the lowest point of ANY curve is horizontal!'
The hand points at the Moon. The dog barks at the hand.
Math is confused with arithmetic, which is one tiny subset of math, and notation, which isn't math at all but is merely a tool mathematicians use to communicate more clearly because some concepts are too bulky when you attempt to express them using words. The essence of math is careful reasoning, following step-by-step processes to derive new truths from existing ones. If you do that, you're doing math, and if you do it without arithmetic or notation, people will say you're doing math without math.
I wonder if you meant to post "There's an argument you need math, because math is logic.", which is tangentially true, but that's not the point i'm making.
> There's an argument that you need math to explain it because it is math
* More specifically, it's a mathematical idea, so, in order to explain it, it would be impossible not to use other mathematical ideas. These other ideas exist throughout the article and are [incorrectly] separated from what the average person thinks when they hear "math" but it's math, nonetheless. For instance, the many graphs throughout the article should be recognized as math.
I consider it "intuition". The real math behind this is inescapable and formal. For someone like me, any article that says zero-math equates to having some intuitive description of whats going on, which is gold. I'll eventually get to the math pieces, but having an intuitive understanding of the topic makes it much easier to grok the math later on
A brilliant mentor of mine(who went to MIT for his undergraduate and graduate degrees if that matters to anyone) used to decry lectures that were too mathematical rather than being more intuitive. I feel the same as it's far easier to develop the mathematics after having been given only the intuition than to develop intuition after having been given only the mathematics. Obviously there should be a balance but I prefer that my graduate mathematics courses err on the side of being too intuitive rather than too mathematical because I can generally figure out the maths on my own after the lecture.
Perhaps "math" is overloaded to sometimes refer to the actual doing of mathematics and at other times to the act of communicating mathematical ideas - which is where symbols/logic/rigour begin to feature.
As for "intuition", I usually seek refuge in Jeff Raskin's theorem "intuitive = familiar" - i.e. you don't lose the content of a statement made using the word intuition if you replaced it with the word familiar .. but now suddenly it is less "mystical" and clearer. Plus it points. To the subjectivity of intuition. - that what is familiar to one may not be familiar to another and so there are different "intuitive" explanations.
So "intuitive understanding of topic" = "understanding a topic based on what you're already familiar with". Explains why quantum mechanics doesn't feel "intuitive". There is nothing in our experience that makes the phenomena "familiar" to us.
Who exactly is the target audience for this? A lot of developers I’ve met who hated math had the skills for it. If you can master a new programming paradigm in a few weeks than you can do math, it’s not the monster all theses writers make it out to be.
I would guess people like me then. I love math but find looking at math the way that math is normally presented is exhausting. When it's pictures and expressed in a real tangible way I can get the concepts very easily. Higher level math is too abstract for me to understand by looking at equations.
Disclosure: I loved how math was normally presented: Layer upon layer of abstraction, connected by proofs. Hog heaven.
But most people hate it that way. So I wonder if a better way to present statistics is to start with a random number generator and graphs. In fact, despite my love of equations, these days I haul out the random number generator anyway, to check whether my equations make sense.
I agree with your point. I'm not a programmer or mathematician, but I enjoy learning math that's new to me.
But when my brain tries to learn math I haven't seen before, I usually can't extract intuition from the formula alone (compare this article with Wikipedia math pages). Short of playing around with the equations, I get a better understanding from articles like this when the topic is still new to me. There's less inertia to understand something like this, so I learn faster.
I don't think of math as a monster, but I appreciate simplification. The title is a bit click-baitey though.
Many social scientists are adopting Bayesian methods. However, they don't always have much of a background in college math. These kinds of guides afford them the intuition necessary to begin implementing related methods in their research.
> A lot of developers I’ve met who hated math had the skills for it. If you can master a new programming paradigm in a few weeks than you can do math, it’s not the monster all theses writers make it out to be.
Let me start off by saying that, as someone whose research is applied mathematics, I agree with you that math does not need to be "scary" if the pedagogy is tailored to the right level and style.
That being said...I think you're being a bit optimistic (or cavalier?), and maybe swinging too far to the other end of the spectrum. I don't believe programming has much overlap with mathematics at all. It's very close to applied logic, but I think even then it's specifically much more like an engineering discipline than a mathematical one. At the higher levels of computer science (like complexity theory) I see a much broader overlap with mathematics, but you use very little of that in typical programming.
I think the skills that make someone a good programmer and the skills that make someone a good mathematician are essentially orthogonal: I've met mathematicians who are excellent programmers (and vice versa), but I personally think that's more because they have learned to keep one discipline "out of the way" of the other one. I don't think most computer scientists or mathematicians actually make very good programmers (and vice versa) because there is little actual overlap in their skillsets.
Once you get past calculus (and maybe linear algebra), math becomes primarily about proofs, not computation. Proving abstractions in math and programming software do not have a significant overlap. I think most people can reasonably learn an undergraduate math curriculum (say, up to abstract algebra and something like elementary number theory), and I think most people can reasonably learn programming. But I don't see any realistic skill transference between these two, i.e. learning one won't make you learn the other any faster.
I'm not trying to claim one subject is obviously harder than the other, I'm just saying that it might give someone false confidence to have them dive into complex math just because they have experienced success in learning new programming frameworks or paradigms every year for their career. That is setting them up for failure - it takes significant mathematical maturity to read through a textbook and understand it without a teacher, just as it takes some domain-specific maturity to optimize learning a new programming language from a textbook or documentation.
People who want to learn math should be prepared to spend 5 - 10 minutes reading each page of a math textbook to really understand the material, followed by a few hours on each chapter's problems. This is absolutely doable for an autodidact, but I think my framing is a bit more realistic than yours. Even if they're innately capable of learning the math, they should be prepared for it to be essentially as difficult as learning programming for the very first time.
It's more the style. Often the technical jargon is arcane and there's usually not a clear distinction between when the author is using a very exacting mathematical term or just a generalized english term unless you are knees deep and spend a large number of your waking hours in a very specific field.
Then there's the naming ... P(X) and we're in probability and X is a random variable but if we are using just "p" then we're in physics and talking about momentum. If it's "P" then it's power, unless it's P(x) and then we're talking about geometry. And then you can italicize it, bold it, put a hat or a dot on it or under it, make the braces square, straight, or curly, and you get something else entirely; I mean completely different fields. Sometimes the same thing is used in multiple fields and then you need substitution syntax when using the two together. Great system!
So if you have a job where you are juggling 6 or so disciplines and you see a jumble of stylized letters in an equation along with a description, it's a fairly absurd system especially when the author assumes that all the readers know what they mean when they say i+p(j)/k or that when they use common everyday words which, in that particular branch of mathematics are actually very specific technical terms.
I often read things and think "what on earth is this person saying?" and then have to go back and decrypt this terribly designed mathematical language everyone uses that we are all supposed to say is a glorious and perfect interface. It's not, it's god awful and horrendous. The vast majority of humanity run screaming from it and can't interface with it to work through even simple concepts that they probably already know.
It's the modern version of ancient latin and it's become equally heretical to insist that we must create a better, more humane, more consistent, more discoverable, more flexible interface to describe the world that isn't just vestigial symbols from far-flung authors spanning 3000 years thrown together in a huge dumpster fire.
Every discipline has the problem that it's jargon is impenetrable to outsiders. Try showing someone with zero programming knowledge a text talking about "string search" and "pointer chasing" and they will be just as confused.
Mathematics is only different because it is useful in many other fields, so you get lots of non-mathematicians trying to make use of some result in isolation. When they don't understand the explanation, they blame it on the unfamiliar words and notation, but often that's just a symptom of not understanding the concepts. Anecdote: I have been taking a course taught in Chinese, which I don't speak very well, so all mathematical jargon is new to me. But just seeing how they were used let me recognize the words for familiar mathematical concepts.
Of course some mathematical writing is just bad, but usually mathematicians will then agree that it's bad. But if some formula comes with an explanation using words you don't understand, there is no way around looking up definitions until you understand the prerequisites. There is no language you could translate mathematics into to make it magically more understandable, unless the translation always prepends an introductory textbook.
I don't think it's heretical to long for better notation, I just don't think it's going to help much. But everyone is free to make up their own symbols, so feel free to go ahead. If it's really better, people will adopt it, just as they adopted superscripts for powers instead of one-off symbols, Leibniz notation instead of Newton's, and so on.
If you feel this way, I encourage you to consider learning a formal verification language like Coq. I've found that reading maths in Coq or the like (quite a lot of maths has already been formalised in some kind of proof assistant) is ultimately easier as although it's more verbose, the writer is forced to specify everything clearly and unambiguously, and checking the exact meaning of any term is only a command away.
I have not read the article, but a title like Zero-math introduction to something may attract those who hate math, do not want to work on something involving math, or are too lazy to re-familiarize themselves with it. What they gain from reading such articles is short-lived. They cannot put it to use. This leads to disappointment and a disturbing trend towards math anxiety.
As a kid, I tried to keep away from math not because I hated it but because my math teacher did not know how to teach it. It was my conclusion then that they had not learned the subject thoroughly. As an adult, I decided to take matters into my own hands and started learning it by myself.
Though I agree with the theme of what you said (mainly that you shouldn't run away from the math), I don't think it's as mutually exclusive as you paint it. Its always good to get different view points of the same topic, and a lot of the times not having it be do math heavy lets you have nice intuition behind the topic. A good example is 3blue1brown's videos. He gives great intuition without sacrificing rigour in his presentation, while not filling his videos with the daunting formulas and such.
I recommend the example in the book Doing Bayesian Data Analysis, in this book to motivate MCMC : A politicean want to visit all of US states and each with frequency proportional to the total population of the state. He takes decisions to visit a new state comparing the population of the current and the proposed state, He don't know in advance the total population of every state, but applying MCMC he archieves his goal. Here you may think that computing the total population of the US is a simple detail, but in other problem the computation involves a very difficult multivariable integral with thousand of variables, that in practice is impossible to calculate.
It has occurred to me that the discussion has shifted towards programming vs mathematics, instead of the actual discussion proposed by the OP to focus on MCMC.
This happens a lot here, focus is lost very easily.
The first thing I saw when I loaded the page was a chart with numbers. If charts are no longer math, then someone has certainly redefined the term math.
mcmc treatment in hopcroft, kannan is really really very good. highly recommended. there are a large number of articles on the web e.g. at math-intersection-programming which are based on that. but, i would really be wary of any 'zero-math' thingy for that.
I'm slightly disappointed, though. Almost all of it is dedicated to introducing prerequisite terminology (prior, posterior, likelihood, markov chains) which probably a lot of readers will already be familiar with, and then the actual explanation of MCMC is just this:
> To begin, MCMC methods pick a random parameter value to consider. The simulation will continue to generate random values (this is the Monte Carlo part), but subject to some rule for determining what makes a good parameter value. The trick is that, for a pair of parameter values, it is possible to compute which is a better parameter value, by computing how likely each value is to explain the data, given our prior beliefs. If a randomly generated parameter value is better than the last one, it is added to the chain of parameter values with a certain probability determined by how much better it is (this is the Markov chain part).
"subject to some rule", "it is possible to"... I feel like this is really glossing over the actual explanation of how this works. Actually, I still have no idea why there is a Markov chain. What is the structure of the chain? Where does it come from and why can't we just sample the parameter without using a chain?
Anyway, I appreciated most of the article and it started out really promising. It would be great if the author could try to expand on that still-baffling part ;)