Markov Chain Monte Carlo Without Jargon

simonbyrne · on April 7, 2015

A couple of problems with the authors definition:

> Let D be a distribution over a finite set X.

In statistics, the space is almost always continuous (i.e. non-finite), typically some moderate-to-high-dimensional Euclidean space (or subset thereof).

> You are given black-box access to the probability distribution function f(x) which outputs the probability of drawing x \in X according to D.

If you know the distribution function, you can do inverse transform sampling. MCMC is typically used when the density function is only known up to proportionality (which seems to be what he actually means).

> Design an efficient randomized algorithm A which outputs an element of X so that the probability of outputting x is approximately f(x).

Now, here's the problem: MCMC doesn't actually give you this. What you get instead is a sequence of random elements, which (given some minor conditions) will converge to the target distribution.

markisus · on April 7, 2015

The author draws a distinction between computing expected value of a random variable and computing the volume of a convex set, but from the point of measure theory, both are computations of integrals. So maybe the definition given from the encyclopedia wasn't so bad, given the right prior knowledge.

mturmon · on April 7, 2015

Well, come on. Lots of things are integrals. The Riemann zeta function is an integral. But I don't think MCMC has anything to say about that.

There is no obvious "measure" in the convex set example, so it's not clear how MCMC applies. And, there are not good MCMC convergence results, so even if you apply it, it's not obvious you'd be able to prove anything.

If you think of the set as being defined by an oracle (a function that reports whether its input is inside or out), then it looks like it will require exponentially many probes (in the dimensionality "d" of the space) to get its volume. Basically, a curse of dimensionality argument -- you have to sample independently along each of "d" axes. But it turns out there is an MCMC that is polynomial-time. (Now, granted, IIRC it is like n^17, but...)

markisus · on April 7, 2015

The Lebesgue measure could be a candidate for the measure you are asking for. Creating a bounding box around the convex set in question and then scaling, we get a probability measure (measure of the entire set is 1). Some sorts of random walks inside this bounding box will converge to this scaled Lebesgue measure. I think this is the main idea behind the convex volume estimation algorithms.

The oracle you mention is the indicator function of the convex set. Instead of integrating this indicator function directly over Lebesgue measure to get volume, we can use MCMC to get the "expectation" of the oracle.

It's not obvious if we can realize the Riemann zeta function as an integral over a finite measure space, but if we could, I think MCMC could be used to evaluate it.

mturmon · on April 7, 2015

Yes, what you just wrote is evident. I'm saying, in response to your assertion above, that it is not trivial to derive a useful result about approximating the volume of convex sets with MCMC, and not at all obvious that anything smarter than independent probes will be useful. I happened to read the original 1991 paper on this topic during my PhD, so I'm still vaguely familiar with it.

To clarify: It's clear than you can approximate a convex volume with exponentially-many independent (random or deterministic) probes. That's therefore not an interesting problem.

Surprisingly: there is NO deterministic algorithm that can solve this problem in polynomial time. But, surprise #2: a randomized algorithm using MCMC can. Pretty cool.

So, this is the relevance of MCMC to the convex volume problem. Used cleverly, it can take you from exponentially-many probes to polynomial.

And in particular: You talk about "creating a bounding box" for the reference Lebesgue measure -- saying that this first step is easy is basically assuming the whole problem is solved. Because: How do you create a bounding box around a convex set in R^d, with polynomially-many probes, so that the error is bounded? You can do it with exponentially-many probes (just go out along each orthogonal axis). But that's not interesting -- if you have that many probes, forget about MCMC, just use a deterministic algorithm.

Here's a short summary: http://www.cs.berkeley.edu/~sinclair/cs294/n16.pdf

xioxox · on April 7, 2015

I'm a scientist who uses MCMC to sample the parameter space of a model and dataset, to get some idea of the uncertainties on the models parameters. These parameters are all continuous (albeit at the finite resolution of floating point numbers). Can someone from a CS background explain how this set approach connects to my idea of MCMC?

For any scientists, I'd heartily recommend emcee [1] as a sampler. This uses Goodman-Weare instead of Metropolis-Hastings, which works a lot better in practice, without having to figure out proposal distributions and so on.

[1] http://dan.iel.fm/emcee/current/

dimatura · on April 7, 2015

This is something addressed by probability theory, not CS. While MCMC is easier to explain and understand for discrete spaces, it can also be used for continuous spaces. As a non-mathematician I just think about this as "discretizing at an infinitely high resolution", but of course to an actual mathematician/statistician this is nonsense.

moultano · on April 7, 2015

MCMC is waaay more general than metropolis-hastings, which is what all that "bullshit" is about.

stoched · on April 7, 2015

Let me make this easier:

Lets plug a bunch of numbers into a function and sum them all up. That sum is the integral of the function that we plugged all the numbers into. That's the Monte Carlo method part.

To get the Markov chain part, we just randomly produce a new number based on the one we just got out of the function, to plug back into the equation.

_betty_ · on April 7, 2015

That post was still quite full of equations my brain didn't feel like processing at this time of night.

Used some google and found a stackexchange answer that I thought was clearer

https://stats.stackexchange.com/questions/165/how-would-you-...

partition · on April 7, 2015

Ok, instead of reading a Wikipedia page or two, or the most elementary chapter in a probability/statistics textbook and gaining a bunch more skills (among them the tools needed to reason about whether your MCMC is doing anything close to reproducing the original distribution), I can peruse a huge blog post filled with patronizing language and it doesn't even have the Metropolis-Hastings algorithm listed out even in pseudocode. Got it, this is what progress looks like.

siddboots · on April 7, 2015

I feel as though your comment comes very close to "gratuitous negativity" of the type mentioned recently on the YC blog.

http://blog.ycombinator.com/new-hacker-news-guideline

Clearly, a blog post primer it is not an alternative to an encyclopedia article or a textbook chapter. Rather it provides something not fulfilled by either: a brief introduction to a technical subject that relies on very little prior knowledge.

I am sure that Jeremy is well aware of the challenges involved in writing in a style that is straight-forward without being "patronizing". Perhaps you don't like the balance that he has found, but is that a good enough reason to outright dismiss his efforts?

Lastly, much of the post is in fact a plain-language explanation of the Metropolis-Hastings algorithm, and he adds: "... if demand is popular enough, I could implement the Metropolis-Hastings algorithm in code"

optimusclimb · on April 7, 2015

I was really surprised to view the comments, and find the attitude expressed in the GP's post #1.

If anything, I read this and thought something along the lines of, "I wish some of the subjects I learned in college were explained this way, at least to start."

Now, I fully appreciate understanding fundamentals, and not black boxing things (see the ongoing framework debate posts), however I didn't read that article looking to be spoon fed a black box for ML or anything...I was just bored and curious. So it made for good reading.

Perhaps I'm generalizing too far, but it reminded me of when I was presented with the classic Lamport paper on "Time, Clocks, and the Ordering of Events in a Distributed System" in undergrad. This would have been around 2004. I had been working (i.e. coding for money) for some time by then, but back then, you could still build a lot with a relatively stock LAMP stack (or similar.) MongoDB, Riak, Redis, Cassandra, DynamoDB, etc. were not all commonplace (much less things you could have a cluster ready to play with by means of a simple 'vagrant up'.) So when I was given the paper, without much context, I had a hard time really "getting it", or how it applied to me. I'm pretty sure I eventually got some commentary on it from the professor in class, while hurriedly going through powerpoint slides, probably anxious to get back to his research. Then a test question or two about it later.

Years later, after working more, and "distributed systems" becoming more of a tangible thing, I re-visited it, and it made more sense. Anyway, my point is that had I gotten some sort of a, "and here's what the paper is talking about, and how it applies to something tangible, in plainspeak", it would have been useful - so we shouldn't groan about an article like this. Someone out there is going to understand the subject much better thanks to that post.

[Note to the reader that's very familiar with the vector clocks paper - I realize it's not written with that much jargon, and DOES do a decent job in the introduction section of using plain language, but the purpose was just lost on me. I mean, it makes reference to ARPA Net, and then goes on to determining event ordering. It just didn't equate to any thing in my world view, and I had done a decent bit of multi-threaded/process programming by then too.]

moultano · on April 7, 2015

It would get a better reception without the intro paragraph and if it were titled "an introduction to Metropolis Hastings." The post really doesn't even give you a taste of the most common MCMC algorithms, and what they are useful for.

For instance, sampling from complicated Bayesian models, and Gibbs sampling are the two most important things to introduce to roughly cover MCMC and those are in the paragraph he calls "bullshit" because he doesn't know the words.

mziel · on April 7, 2015

Sometimes I get the idea that people in CS want to "get all this machine learning/data science stuff" without understanding the principles.

Sure, it's trivially easy to build a classifier in R/Scikit (or even code one up from scratch) so it may seem like it's no big deal. Similarily, it's trivially easy to hack up an inefficient SQL query, without knowledge of relational algebra or normal forms. But I would argue that some basic knowledge is essential in applying these things.

stoched · on April 7, 2015

"Sometimes I get the idea that people in CS want to "get all this machine learning/data science stuff" without understanding the principles."

Thought I was the only one.

blumkvist · on April 7, 2015

Not at all. People want easy things. Just look at all the companies that provide "Easy A/B testing" or "Easy lead scoring" via super-complicated-data-scienice-stuff-we-will-explain-in-1000-words-or-less-blogposts.

It doesn't matter that the results are absolute junk. Customers pay money so they can feel good. And it's easy.