When he talks about performance, he mentions that memoization allows performance... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

amalcon on July 19, 2010 | parent | context | favorite | on: Kill hashtables, get shorter code.

When he talks about performance, he mentions that memoization allows performance to not suffer very much, but what he doesn't mention that almost the entire performance gain comes from memoizing frequencies... which stores the hashtables anyway; it just does it behind the scenes. This is fine for CPU usage, but it does have a significant storage penalty depending on the characteristics of the input.

There's some low-hanging fruit to cut down on the required storage, namely wrapping to-words and frequencies together, and memoizing that. This doesn't reduce the number of hashtables, but at least there's no need to store the entire word lists. As it stands, this will simply fail if your corpus is too large to fit in your RAM all at once.

herdrick on July 19, 2010 [–]

Without profiling it I'd guess it's memoizing the function that does the disk i/o that makes the biggest difference.

amalcon on July 19, 2010 | [–]

Right, I realized after writing that that I was forgetting about that one. Still, memoizing frequencies takes it from O(n^2) to O(n) right now, so it's easily the second largest increase.

Consider applying for YC's Spring batch! Applications are open till Feb 11.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact