Hacker News new | past | comments | ask | show | jobs | submit login
Darts, Dice, and Coins: Sampling from a Discrete Distribution (2011) (keithschwarz.com)
67 points by elasticdog on March 12, 2018 | hide | past | favorite | 5 comments



Alias tables are pretty cool, I wrote an interactive visualization of how they get built as part of this post a while ago: http://engineering.flipboard.com/2017/02/storyclustering . We used alias tables with MCMC and the hastings metropolis test to build a super fast LDA.

Also worth reading up on are sum-heaps. Alias tables are O(1) to sample from but O(n) to build/modify. Sum-heaps let you modify in O(log(n)) at the cost of sampling in O(log(n)) as well. A good writeup is here: https://timvieira.github.io/blog/post/2016/11/21/heaps-for-i...


There's a really cool way to do this that I learned about with ruby here https://gist.github.com/O-I/3e0654509dd8057b539a

Here's a quick demo class that shows the technique. It's amazingly simple.

class Demo EXAMPLE = { "75%" => 0.75, "15%" => 0.15, "9%" => 0.09, "1%" => 0.01 }

  def self.sample(choices = EXAMPLE)
    choices.max_by { |_, weight| rand ** (1.0 / weight) }.first
  end

  def self.show(samples: 10000)
    items = samples.times.map {sample}
    items.each_with_object(Hash.new(0)) do |item, counts|
      counts[item]+=1
    end.sort_by(&:last).reverse.to_ h
  end
end

# Demo.show # => {"75%"=>7480, "15%"=>1525, "9%"=>901, "1%"=>94}

Edit: Sorry for the bad formatting. I added a gist:

https://gist.github.com/jasonl99/25d3c922d73f10a75fe228c2de3...


The code's nice and short, but note that if you are choosing from a large collection it's really inefficient: in order to generate a single random element it needs to do some computation for every element in the collection. That is, its runtime is O(n).

(For choosing from a small number of things this code has the advantage that it's not doing a lot of complicated stuff in Ruby, and in any case the speed probably doesn't matter too much. In fact, even when choosing from a large enough number of things to make this much slower than it need be, the speed may well not matter at all. But it's worth being aware.)


The random sampling algorithms described on the page are very closely related to the technique of https://en.wikipedia.org/wiki/Arithmetic_coding in data compression.


This page has helped me a lot. I have used that alias table algorithm various times with very good results.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: