Geekiest Hacker News comments from the last month

riffer · on Aug 23, 2010

TLDR version: We took a corpus of 25k comments from HN, analyzed them to infer semantic similarity. Next we came up with a seed of 18 words out of the 40k words in our corpus, scored the corpus of 40k words based on similarity with those 18 words. We then analyzed the 25k comments to score them based on the scores assigned to the 40k words. Major weaknesses: [1] words not in the corpus are dropped, so technical text with (relatively) obscure words may not score well, and [2] our seed of 18 words was highly subjective and put together in <5 mins.

Also, there are a couple of dynamic links at the bottom of the post, for those who want to play with the mechanics, search for themselves and others, etc.

eitally · on Aug 23, 2010

as an addendum, since none of the results seemed overly technical to me, here are the "technical" words they used to gauge geekiness:

CS, Clojure, Debian, Haskell, JavaScript, Python, Rails, Scala, algorithm, compiler, engineer, frameworks, jQuery, macros, open-source, process, servers, stack

dantheman · on Aug 23, 2010

Ahh, I see they left out Erlang.

Vivtek · on Aug 24, 2010

Unfortunately, early design phases made the assumption that geekiness would max at 100. Erlang violated that assumption, being 117% geeky, causing the score calculation to dump core.

It was deemed safest simply to drop Erlang from the corpus.

riffer · on Aug 23, 2010

Good point, in the case of Erlang it wasn't a picking favorites thing, so much as for whatever reason Erlang was not particularly well represented in the corpus of 25k comments that we grabbed from the site.

Vivtek · on Aug 24, 2010

Honestly - if a particular technical topic is mentioned less that arguably makes it more geeky.

dman · on Aug 23, 2010

On the topic of things being left out - does anyone have ideas on how to automate the process of finding seed terms for a given topic ?

Sukotto · on Aug 24, 2010

Well, if I were doing it, I would search for words and short phrases that have their own Wikipedia pages. Give more weight to those whose pages contain a lot of text, have inline images, or have particularly contentious editing/reverting/meta-talk.

Likewise search for those same words and phrases in Google. Weight searches with fewer results more heavily.

dman · on Aug 24, 2010

Thanks for the wikipedia suggestion - will try that one out. I am unclear about the terms of use for google search and if we can make use of search results in our tool.

_b8r0 · on Aug 24, 2010

And ASM. All the cool kids write web apps in x86, except for the really cool 6502 kids in the corner.

tkhoven · on Aug 24, 2010

And as for the "non-technical" words, was anyone else momentarily confused about the "less technical word like 'war'"? Or maybe I've been in java land for too long...

davidw · on Aug 24, 2010

Definitely missing 'monad'. I completely understood the first comment in your list, but the monad stuff always loses me quickly.

Periodic · on Aug 23, 2010

I also just noticed that they score capitalized and uncapitalized versions of their words differently.

For example, "scala" is 67.03/100, while "Scala" is 90.92/100.

riffer · on Aug 23, 2010

Nice catch.

We generally keep the set of words case-sensitive because that can be helpful for disambiguation in broader corpuses ("Python" is often something different than "python"), and also because POS taggers tend to have trouble identifying proper nouns if case is not preserved).

But in this situation "scala" and "Scala" definitely should score similarly.

whimsy · on Aug 24, 2010

I've never seen anyone mention a real python or ruby here. That doesn't mean it doesn't happen, but I think it's a safe assumption just to assume that the typist was lazy.

riffer · on Aug 24, 2010

I didn't think it would ever happen either, and then I saw this: http://news.ycombinator.com/item?id=748632

On the essence, though, you're 100% right

jurjenh · on Aug 24, 2010

I would go as far as to suggest that mentioning python in this manner (in this community) would in fact be geekier, thus merit a higher score.

Not sure about ruby, but would venture that it wouldn't make much of a difference on the scoring...

JacobAldridge · on Aug 23, 2010

"But in this situation "scala" and "Scala" definitely should score similarly."

Not necessarily. I don't think it's a stretch to assume that those who capitalise Scala might be somewhat more geeky?

nl · on Aug 24, 2010

I don't think it's a stretch to assume that those who capitalise Scala might be somewhat more geeky?

Scala should always start with a capital letter (it's a proper noun). People who use the lower case version either don't know, don't care, or are typing on a mobile keyboard.

JacobAldridge · on Aug 24, 2010

"either don't know, don't care, or are typing on a mobile keyboard" and are therefore, I would contend, less 'geeky' than those of us who consistently apply the capital.

I mean, my username isn't particularly geeky, but enforcing the capitalisation must certainly be somewhere down that end of the spectrum!

Zak · on Aug 23, 2010

I think it needs a bit of work. Here's pg's highest-scoring comment from the past month: http://news.ycombinator.com/item?id=1606788 (score: 81.91). It's entirely non-technical, but outranks many highly technical comments, like this one from jacquesm: http://news.ycombinator.com/item?id=1574015

dman · on Aug 23, 2010

Totally agree with it needing more work. This is about a week old at this point and were working hard to make it better.

pyre · on Aug 23, 2010

Apparently:

  > I think the 'pain' comes when there are issues with the c
  > library that lxml binds to.

Scores 80.99/100

I think that the algorithm needs more work....

Also:

   > I count 8:
  >     Avatar (2009) - Yep
  >     Titanic (1997) - Yep
  >     The Dark Knight (2008) - Yep
  >     Star Wars: Episode IV - A New Hope (1977) - Yep
  >     Shrek 2 (2004) - Nope
  >     E.T.: The Extra-Terrestrial (1982) - Nope
  >     Star Wars: Episode I - The Phantom Menace (1999) - Yep
  >     Pirates of the Caribbean: Dead Man's Chest (2006) - Yep
  >     Spider-Man (2002) - Yep
  >     Transformers: Revenge of the Fallen (2009) - Yep

Scored 74.45/100

docgnome · on Aug 23, 2010

Hrm... It says my geekiest comment was "Any examples?" Doesn't seem very geeky... Maybe I'm not a geek... *has an existential crisis

jfraser · on Aug 24, 2010

If you just hit 'Calculate Score' on the default 'Enter text ...' phrase, you score 65.08 / 100.

samg_ · on Aug 24, 2010

What algorithm is used to get the centroid clusters? Do you need to know the number of clusters in advance? I am familiar with max-link/min-link/avg-link hierarchical algorithms, but not centroid related ones.

riffer · on Aug 24, 2010

We represent clusters as a type of node. So if your other nodes have coordinates in a space, there will be distances between your clusters. Or they can be part of a graph traversal, etc.

samg_ · on Aug 24, 2010

Can you be more specific about the cluster membership testing? Sure it is based on some "distance" calculation, but how do you avoid long chain problems inherent to clustering algorithms? And to reiterate, do you need to know the number of clusters ahead of time?

riffer · on Aug 25, 2010

send me an email, it's in my profile

thristian · on Aug 24, 2010

So, this is some joke site where it sees you're logged into Hacker News, finds your userID, grabs a random one of your comments and fingers you of being the geekiest geek in geekdom, right?

...right?

dman · on Aug 24, 2010

You just happened to be one of the lucky ones!

dstein · on Aug 24, 2010

These aren't very geeky. I understood almost everything.

I would rather see the complete back-and-forth of the "most geekiest argument on HN".

makmanalp · on Aug 23, 2010

These look just as geeky as any other to me.

Natsu · on Aug 23, 2010

Just out of curiousity, I compared myself to the most notable HNN names I could think of.

My highest is 83.79. Compare that to patio11 at 77.89 or Chromatic at 80.90, or pg at 81.91. RiderOfGiraffes manages to top us all, though, with a comment saying "This is a thin veneer on the slide-show already discussed here. [link]" which scores 91.44. The next one after that falls clear down to 76.97, which is itself rather typical (it seems that many of us have a steep drop after one "geekiest" comment).

I'm not convinced that this is a useful metric. It's an interesting experiment, perhaps, but one problem is that "geekiness" isn't necessarily what we want in comments (or perhaps it is necessary, but not sufficient).

I can't believe that short link was RoG's best comment this month, for example. And the top few comments from pg are certainly not his best (the more interesting ones seem to start a bit below the 50s). YMMV.

silentbicycle · on Aug 23, 2010

I got a 98.77 for this one: http://news.ycombinator.com/item?id=1604552

It looks like jibber-jabbering about continuations and call/cc rates highly. I agree that it's probably not a useful metric, but my inner stats nerd is fascinated all the same.

wwortiz · on Aug 23, 2010

My comment: They took it down because people were defacing it actually scored 100 without even being more than slightly technical.

riffer · on Aug 24, 2010

Yeah, this is an interesting case.

It's all coming from 'defacing', your comment is actually the only one in this corpus of 25k comments that uses that term; probably not a coincidence. Taking a deeper look now, thanks for your help.

dman · on Aug 23, 2010

The current code definitely exhibits a phenomena where non technical comments can routinely get a score of ~50. This was cooked up over a couple of days so we havent spent too much time scaling the final scores to have 0 correspond to horridly non-technical and 100 as completely technical.

phoenix24 · on Aug 24, 2010

I am just curious, are you planning to release implementation details; or they are already available somewhere?

dman · on Aug 24, 2010

What would you like to know ? We could perhaps write a follow on blog post with additional details.

phoenix24 · on Aug 26, 2010

I really, wouldn't know where to start if ever needed to make a similar application. so, i could use it as a starting point.

maybe if you suggest a general outline of the steps taken, and tools used that'll be a headstart.

thanks a lot.

ritonlajoie · on Aug 24, 2010

How did you gather to get all the comments ?

powrtoch · on Aug 23, 2010

Hopefully no one will take this as a challenge...

sp332 · on Aug 23, 2010

Not too hard. "Scala Scala Scala" is enough to round up to 100.0%.

It did say this (legitimate) comment of mine was 99.57%. http://news.ycombinator.com/item?id=1599910

jerf · on Aug 23, 2010

I'm a little confused by the scoring; I have two comments at "100.00", and one of them is "http://www.retrologic.com/jargon/H/hacker-humor.htmlNot that I want HN to become a joke-a-day site but accidental-hacker-humor like this is a unusual enough find that I can deal with the rare exception."

The other one is more legitimately geeky, but I'm lost on what 100.00 means exactly.

(Edit: Must be "hacker", but I'm still curious what the 100.00 means.)

riffer · on Aug 23, 2010

"hacker" and also "exception" scores really well

There aren't any words like "war" or "contrary" that are far way from the seed, and would indicate that some non-technical meaning should be inferred for "exception"

On the "100" ... that is just a z-score scaled to a 0-100 range, with capping at 100 rather than making it approach asymptotically.

riffer · on Aug 23, 2010

Funny that you should try gaming it. This week we're using the platform to do blog comment spam filtering. Is the comment off-topic (or just not on-topic) for the post?

Edit: To clarify, I was just saying that we have spent a lot of time today thinking about how these things can be gamed by evil spammers, not calling anybody or anything off-topic; apologies for any confusion

sp332 · on Aug 23, 2010

Well, it was closely related to the comment I replied to, but that comment was probably off-topic for the thread.

Generally, if you put "sp332" into http://www.swimwithoutgettingwet.com/discourse/hn_comments/ the comments around 75% are all on-topic except #5. I don't understand why #14, "Update: nope." gets a rank of 62.09?

paulgb · on Aug 23, 2010

The challenge is to engineer a non-geeky comment that can stack up against the others.

powrtoch · on Aug 23, 2010

Scala, my pet python, bit the rails of his cage's framework wide open-sources say the process takes stacks of CS (chewing sounds).