Unfortunately, early design phases made the assumption that geekiness would max at 100. Erlang violated that assumption, being 117% geeky, causing the score calculation to dump core.
It was deemed safest simply to drop Erlang from the corpus.
Good point, in the case of Erlang it wasn't a picking favorites thing, so much as for whatever reason Erlang was not particularly well represented in the corpus of 25k comments that we grabbed from the site.
Well, if I were doing it, I would search for words and short phrases that have their own Wikipedia pages. Give more weight to those whose pages contain a lot of text, have inline images, or have particularly contentious editing/reverting/meta-talk.
Likewise search for those same words and phrases in Google. Weight searches with fewer results more heavily.
Thanks for the wikipedia suggestion - will try that one out. I am unclear about the terms of use for google search and if we can make use of search results in our tool.
And as for the "non-technical" words, was anyone else momentarily confused about the "less technical word like 'war'"? Or maybe I've been in java land for too long...
CS, Clojure, Debian, Haskell, JavaScript, Python, Rails, Scala, algorithm, compiler, engineer, frameworks, jQuery, macros, open-source, process, servers, stack