Hacker News new | past | comments | ask | show | jobs | submit login

doesn't your implementation require you to fetch all data into the client process for ruby to iterate over?

isn't the movement of such data -- when you have billions of rows simply impossible in reasonable time?




This implementation is useful at some scales, but not all: it uses find_each to iterate over the ActiveRecord scope, which is going to be slower than doing it all in the database (if that's possible), but still way better than instantiating every record in a large table.

(find_each instantiates a few records at a time, then throws them away)


it will, in general, instantiate N-n/N records (even distribution, right - N=total records, n=sample size) that is still too much data to fetch to the client for any large data set.


You don't have to instantiate the records; that's just the way I did it here. For example, you could do two passes: one to randomly select IDs, and one to fetch a small number of records for the sample set. That's O(2N), but still better than what most databases will do for an ORDER BY RANDOM() query.

There's also another, less-known variant of this algorithm that I'll go into in a future post that alleviates the concern.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: