doesn't your implementation require you to fetch all data into the client proces...

timr · on March 6, 2015

This implementation is useful at some scales, but not all: it uses find_each to iterate over the ActiveRecord scope, which is going to be slower than doing it all in the database (if that's possible), but still way better than instantiating every record in a large table.

(find_each instantiates a few records at a time, then throws them away)

keithgabryelski · on March 6, 2015

it will, in general, instantiate N-n/N records (even distribution, right - N=total records, n=sample size) that is still too much data to fetch to the client for any large data set.

timr · on March 6, 2015

You don't have to instantiate the records; that's just the way I did it here. For example, you could do two passes: one to randomly select IDs, and one to fetch a small number of records for the sample set. That's O(2N), but still better than what most databases will do for an ORDER BY RANDOM() query.

There's also another, less-known variant of this algorithm that I'll go into in a future post that alleviates the concern.