Hacker News new | past | comments | ask | show | jobs | submit login
Geekiest Hacker News comments from the last month (swimwithoutgettingwet.com)
79 points by riffer on Aug 23, 2010 | hide | past | favorite | 50 comments



TLDR version: We took a corpus of 25k comments from HN, analyzed them to infer semantic similarity. Next we came up with a seed of 18 words out of the 40k words in our corpus, scored the corpus of 40k words based on similarity with those 18 words. We then analyzed the 25k comments to score them based on the scores assigned to the 40k words. Major weaknesses: [1] words not in the corpus are dropped, so technical text with (relatively) obscure words may not score well, and [2] our seed of 18 words was highly subjective and put together in <5 mins.

Also, there are a couple of dynamic links at the bottom of the post, for those who want to play with the mechanics, search for themselves and others, etc.


as an addendum, since none of the results seemed overly technical to me, here are the "technical" words they used to gauge geekiness:

CS, Clojure, Debian, Haskell, JavaScript, Python, Rails, Scala, algorithm, compiler, engineer, frameworks, jQuery, macros, open-source, process, servers, stack


Ahh, I see they left out Erlang.


Unfortunately, early design phases made the assumption that geekiness would max at 100. Erlang violated that assumption, being 117% geeky, causing the score calculation to dump core.

It was deemed safest simply to drop Erlang from the corpus.


Good point, in the case of Erlang it wasn't a picking favorites thing, so much as for whatever reason Erlang was not particularly well represented in the corpus of 25k comments that we grabbed from the site.


Honestly - if a particular technical topic is mentioned less that arguably makes it more geeky.


On the topic of things being left out - does anyone have ideas on how to automate the process of finding seed terms for a given topic ?


Well, if I were doing it, I would search for words and short phrases that have their own Wikipedia pages. Give more weight to those whose pages contain a lot of text, have inline images, or have particularly contentious editing/reverting/meta-talk.

Likewise search for those same words and phrases in Google. Weight searches with fewer results more heavily.


Thanks for the wikipedia suggestion - will try that one out. I am unclear about the terms of use for google search and if we can make use of search results in our tool.


And ASM. All the cool kids write web apps in x86, except for the really cool 6502 kids in the corner.


And as for the "non-technical" words, was anyone else momentarily confused about the "less technical word like 'war'"? Or maybe I've been in java land for too long...


Definitely missing 'monad'. I completely understood the first comment in your list, but the monad stuff always loses me quickly.


I also just noticed that they score capitalized and uncapitalized versions of their words differently.

For example, "scala" is 67.03/100, while "Scala" is 90.92/100.


Nice catch.

We generally keep the set of words case-sensitive because that can be helpful for disambiguation in broader corpuses ("Python" is often something different than "python"), and also because POS taggers tend to have trouble identifying proper nouns if case is not preserved).

But in this situation "scala" and "Scala" definitely should score similarly.


I've never seen anyone mention a real python or ruby here. That doesn't mean it doesn't happen, but I think it's a safe assumption just to assume that the typist was lazy.


I didn't think it would ever happen either, and then I saw this: http://news.ycombinator.com/item?id=748632

On the essence, though, you're 100% right


I would go as far as to suggest that mentioning python in this manner (in this community) would in fact be geekier, thus merit a higher score.

Not sure about ruby, but would venture that it wouldn't make much of a difference on the scoring...


"But in this situation "scala" and "Scala" definitely should score similarly."

Not necessarily. I don't think it's a stretch to assume that those who capitalise Scala might be somewhat more geeky?


I don't think it's a stretch to assume that those who capitalise Scala might be somewhat more geeky?

Scala should always start with a capital letter (it's a proper noun). People who use the lower case version either don't know, don't care, or are typing on a mobile keyboard.


"either don't know, don't care, or are typing on a mobile keyboard" and are therefore, I would contend, less 'geeky' than those of us who consistently apply the capital.

I mean, my username isn't particularly geeky, but enforcing the capitalisation must certainly be somewhere down that end of the spectrum!


I think it needs a bit of work. Here's pg's highest-scoring comment from the past month: http://news.ycombinator.com/item?id=1606788 (score: 81.91). It's entirely non-technical, but outranks many highly technical comments, like this one from jacquesm: http://news.ycombinator.com/item?id=1574015


Totally agree with it needing more work. This is about a week old at this point and were working hard to make it better.


Apparently:

  > I think the 'pain' comes when there are issues with the c
  > library that lxml binds to.
Scores 80.99/100

I think that the algorithm needs more work....

Also:

   > I count 8:
  >     Avatar (2009) - Yep
  >     Titanic (1997) - Yep
  >     The Dark Knight (2008) - Yep
  >     Star Wars: Episode IV - A New Hope (1977) - Yep
  >     Shrek 2 (2004) - Nope
  >     E.T.: The Extra-Terrestrial (1982) - Nope
  >     Star Wars: Episode I - The Phantom Menace (1999) - Yep
  >     Pirates of the Caribbean: Dead Man's Chest (2006) - Yep
  >     Spider-Man (2002) - Yep
  >     Transformers: Revenge of the Fallen (2009) - Yep
Scored 74.45/100


Hrm... It says my geekiest comment was "Any examples?" Doesn't seem very geeky... Maybe I'm not a geek... *has an existential crisis


If you just hit 'Calculate Score' on the default 'Enter text ...' phrase, you score 65.08 / 100.


What algorithm is used to get the centroid clusters? Do you need to know the number of clusters in advance? I am familiar with max-link/min-link/avg-link hierarchical algorithms, but not centroid related ones.


We represent clusters as a type of node. So if your other nodes have coordinates in a space, there will be distances between your clusters. Or they can be part of a graph traversal, etc.


Can you be more specific about the cluster membership testing? Sure it is based on some "distance" calculation, but how do you avoid long chain problems inherent to clustering algorithms? And to reiterate, do you need to know the number of clusters ahead of time?


send me an email, it's in my profile


So, this is some joke site where it sees you're logged into Hacker News, finds your userID, grabs a random one of your comments and fingers you of being the geekiest geek in geekdom, right?

...right?


You just happened to be one of the lucky ones!


These aren't very geeky. I understood almost everything.

I would rather see the complete back-and-forth of the "most geekiest argument on HN".


These look just as geeky as any other to me.


Just out of curiousity, I compared myself to the most notable HNN names I could think of.

My highest is 83.79. Compare that to patio11 at 77.89 or Chromatic at 80.90, or pg at 81.91. RiderOfGiraffes manages to top us all, though, with a comment saying "This is a thin veneer on the slide-show already discussed here. [link]" which scores 91.44. The next one after that falls clear down to 76.97, which is itself rather typical (it seems that many of us have a steep drop after one "geekiest" comment).

I'm not convinced that this is a useful metric. It's an interesting experiment, perhaps, but one problem is that "geekiness" isn't necessarily what we want in comments (or perhaps it is necessary, but not sufficient).

I can't believe that short link was RoG's best comment this month, for example. And the top few comments from pg are certainly not his best (the more interesting ones seem to start a bit below the 50s). YMMV.


I got a 98.77 for this one: http://news.ycombinator.com/item?id=1604552

It looks like jibber-jabbering about continuations and call/cc rates highly. I agree that it's probably not a useful metric, but my inner stats nerd is fascinated all the same.


My comment: They took it down because people were defacing it actually scored 100 without even being more than slightly technical.


Yeah, this is an interesting case.

It's all coming from 'defacing', your comment is actually the only one in this corpus of 25k comments that uses that term; probably not a coincidence. Taking a deeper look now, thanks for your help.


The current code definitely exhibits a phenomena where non technical comments can routinely get a score of ~50. This was cooked up over a couple of days so we havent spent too much time scaling the final scores to have 0 correspond to horridly non-technical and 100 as completely technical.


I am just curious, are you planning to release implementation details; or they are already available somewhere?


What would you like to know ? We could perhaps write a follow on blog post with additional details.


I really, wouldn't know where to start if ever needed to make a similar application. so, i could use it as a starting point.

maybe if you suggest a general outline of the steps taken, and tools used that'll be a headstart.

thanks a lot.


How did you gather to get all the comments ?


Hopefully no one will take this as a challenge...


Not too hard. "Scala Scala Scala" is enough to round up to 100.0%.

It did say this (legitimate) comment of mine was 99.57%. http://news.ycombinator.com/item?id=1599910


I'm a little confused by the scoring; I have two comments at "100.00", and one of them is "http://www.retrologic.com/jargon/H/hacker-humor.htmlNot that I want HN to become a joke-a-day site but accidental-hacker-humor like this is a unusual enough find that I can deal with the rare exception."

The other one is more legitimately geeky, but I'm lost on what 100.00 means exactly.

(Edit: Must be "hacker", but I'm still curious what the 100.00 means.)


"hacker" and also "exception" scores really well

There aren't any words like "war" or "contrary" that are far way from the seed, and would indicate that some non-technical meaning should be inferred for "exception"

On the "100" ... that is just a z-score scaled to a 0-100 range, with capping at 100 rather than making it approach asymptotically.


Funny that you should try gaming it. This week we're using the platform to do blog comment spam filtering. Is the comment off-topic (or just not on-topic) for the post?

Edit: To clarify, I was just saying that we have spent a lot of time today thinking about how these things can be gamed by evil spammers, not calling anybody or anything off-topic; apologies for any confusion


Well, it was closely related to the comment I replied to, but that comment was probably off-topic for the thread.

Generally, if you put "sp332" into http://www.swimwithoutgettingwet.com/discourse/hn_comments/ the comments around 75% are all on-topic except #5. I don't understand why #14, "Update: nope." gets a rank of 62.09?


The challenge is to engineer a non-geeky comment that can stack up against the others.


Scala, my pet python, bit the rails of his cage's framework wide open-sources say the process takes stacks of CS (chewing sounds).




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: