that's a horrible pattern. As soon as you go 1 bit over your cache size in a hot...

rspeer · on April 19, 2017

You don't use this pattern while looping. You use it while memoizing results that you don't want to compute again.

Here's an example, in the Python package wordfreq [1]. When you look up word frequencies, there's some normalization it has to do to your text. Some words are very common, and it would be silly to normalize them repeatedly. So frequency lookups are cached.

The cache dictionary has a maximum size of 100,000, so that exceptional or malicious cases don't use up all your memory. It is extremely rare for text to contain 100,000 distinct words, but if that happens, the cache gets full and the entire thing is dropped. It will quickly be re-populated with common words, of course.

Yes, you can make this perform horribly by looking up frequencies of ["zero", "one", "two", "three", "four", ..., "one hundred thousand"] in a loop. That's not realistic data.

Do you actually have a suggestion for how to do this faster? We benchmarked this against other cache strategies (with smaller cache sizes so it would drop sometimes) on realistic data. It's much faster than Python's LRU decorator, for example.

[1] https://github.com/LuminosoInsight/wordfreq/blob/master/word...

jandrese · on April 19, 2017

Strict LRU has the exact same problem.

That's one reason the article was advocating for 2-Random, since it has more gradual performance degradation when the data set gets too large.