Frequency Counting Algorithms Over Data Streams

haberman · on July 20, 2015

This area of CS (one pass analysis over data streams with tight memory bounds) is pretty well-studied. They are known as "data stream algorithms."

This 2002 paper describes a data structure for this that I suspect is superior to what is described in the article: https://www.cs.rutgers.edu/~farach/pubs/FrequentStream.pdf

georgehm · on July 20, 2015

"Count-Min and Count Sketches" section at http://web.stanford.edu/class/cs166/ provides good explanations of this topic as well.

talaketu · on July 21, 2015

> “Which pages are getting an unusual hit in the last 30 minutes?”

A good statistic for this is an exponentially weighted moving average, which can be computed online by a leaky integrator using a single accumulator register per counted entity.

It's not so much as counting, as estimating a rate.

haberman · on July 21, 2015

> a single accumulator register per counted entity.

That requires O(n) memory space, where n is the number of counted entities. If you're willing to spend this much memory, the problem is easy. Algorithms like the one in the article (or the paper I mentioned) are for when you want to use far less memory than that.

ted_dunning · on July 21, 2015

Count-min sketches can be adapted to use exponential decay. Presumably any sort of recurrent decay could be used to get approximately rectangular windows as well.

amelius · on July 20, 2015

Why not store the items in a random order? This can be achieved by performing a sort using a random key. Sorting can be done as a background job, so that the list is always "reasonably" random.

mvogiatzis · on July 20, 2015

The items are stored in a map in random order (Red -> 45, Blue -> 23 etc). The picture makes it easier to read by showing many red boxes but the actual implementation just stores a number for each item label.

JohnDoe365 · on July 21, 2015

How does HyperLogLog fit in here?

http://antirez.com/news/75

steamer25 · on July 21, 2015

My understanding is that HyperLogLog will estimate the cardinality e.g., the number of different types of events in the stream. The frequency counting algorithms in the article can be used to estimate the number of occurrences of each of the most frequently occurring types.

E.g., Where the frequency counting algorithms might estimate the following counts for a given stream: A: 2, B: 5, C:1 ...HyperLogLog would estimate that you have three types of events (A, B and C but you only get the count: 3) in it. Presumably, HyperLogLog requires fewer time and space resources to achieve this more summarized perspective.