Darts, Dice, and Coins: Sampling from a Discrete Distribution (2011)

benfrederickson · on March 13, 2018

Alias tables are pretty cool, I wrote an interactive visualization of how they get built as part of this post a while ago: http://engineering.flipboard.com/2017/02/storyclustering . We used alias tables with MCMC and the hastings metropolis test to build a super fast LDA.

Also worth reading up on are sum-heaps. Alias tables are O(1) to sample from but O(n) to build/modify. Sum-heaps let you modify in O(log(n)) at the cost of sampling in O(log(n)) as well. A good writeup is here: https://timvieira.github.io/blog/post/2016/11/21/heaps-for-i...

jasonl99 · on March 13, 2018

There's a really cool way to do this that I learned about with ruby here https://gist.github.com/O-I/3e0654509dd8057b539a

Here's a quick demo class that shows the technique. It's amazingly simple.

class Demo EXAMPLE = { "75%" => 0.75, "15%" => 0.15, "9%" => 0.09, "1%" => 0.01 }

  def self.sample(choices = EXAMPLE)
    choices.max_by { |_, weight| rand ** (1.0 / weight) }.first
  end

  def self.show(samples: 10000)
    items = samples.times.map {sample}
    items.each_with_object(Hash.new(0)) do |item, counts|
      counts[item]+=1
    end.sort_by(&:last).reverse.to_ h
  end

end

# Demo.show # => {"75%"=>7480, "15%"=>1525, "9%"=>901, "1%"=>94}

Edit: Sorry for the bad formatting. I added a gist:

https://gist.github.com/jasonl99/25d3c922d73f10a75fe228c2de3...

gjm11 · on March 13, 2018

The code's nice and short, but note that if you are choosing from a large collection it's really inefficient: in order to generate a single random element it needs to do some computation for every element in the collection. That is, its runtime is O(n).

(For choosing from a small number of things this code has the advantage that it's not doing a lot of complicated stuff in Ruby, and in any case the speed probably doesn't matter too much. In fact, even when choosing from a large enough number of things to make this much slower than it need be, the speed may well not matter at all. But it's worth being aware.)

nayuki · on March 12, 2018

The random sampling algorithms described on the page are very closely related to the technique of https://en.wikipedia.org/wiki/Arithmetic_coding in data compression.

gmueckl · on March 13, 2018

This page has helped me a lot. I have used that alias table algorithm various times with very good results.