Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why doesn't "Why I Hate Frameworks" appear in your list? From dang's post it looks like that exact title has appeared 8 times?

Possibly related: have you considered making your aggregator ignore punctuation and case in addition to the year indicators?



Not all appearances are front-page appearances. And my very quick & dirty shell one-liner requires precise text matches.

"Why I Hate Frameworks" does appear six times on the front page, though with case and phrasing variants which mean that an exact text match won't pick it up:

  Why I Hate Frameworks
  Why I Hate Frameworks
  Why I hate frameworks (2005)
  Why I Hate Frameworks
  [dupe] Why I Hate Frameworks (2005)
  Why I Hate Frameworks (2005)
By my shell one-liner, it would have been counted as appearing four times (instances 1, 2, 4, and 6 above).

I could tweak the script by lowercasing (or title-casing) titles (I've a script that does the latter), and eliminating any instances of "[dupe]". While I'm at it, "[flagged]" and "[dead]" possibly as well, though I think the latter won't appear in the archive. And let's convert all apostrophe variants to a single apostrophe character (') (ASCII hex 0x27).

With those adjustments, I see "Why I Hate Frameworks" appearing six times, as expected, and there are 57 titles appearing 5+ times, rather more than the 39 initially posted above. There are 3,077 repeated front-page titles in total.

The one-liner, FWIW (rearranged to multiple lines):

  egrep '^  Title:' parse.log |
    sed 's/^  Title: //' | 
    sed 's/([0-9][0-9][0-9][0-9]) *$//; s/\[dupe\]//; s/\[flagged\]//; s/^  *//; s/  *$//; s/  */ /g; ' |
    sed "s/’/'/g" |
    titlecase |
    sort |
    uniq -c |
    sort -k1nr |
    gawk '$1 > 1' |
    cat -n
"titlecase" itself is a sed script that uppercases words, with specific exceptions. I've just augmented it with a few more technology-related terms.

"parse.log" is the output of my first stage of parsing of the HN front page archive, which lists specific story properties in a tagged format, e.g., "Title:" "Date:" "Site:", etc.


And the updated list, 5+ appearances:

       1   14 I'm Peter Roberts, Immigration Attorney Who Does Work for YC and Startups. AMA
       2   11 Richard Feynman and the Connection Machine
       3   10 Openssl Security Advisory
       4   10 The TTY Demystified
       5    8 Why GNU Grep is Fast
       6    7 Cool URIs Don't Change
       7    7 How to Read Mathematics
       8    7 The Architecture of Open Source Applications
       9    7 You and Your Research
      10    6 -2000 Lines of Code
      11    6 A Primer On Bézier Curves
      12    6 Advanced Programming Languages
      13    6 Bit Twiddling Hacks
      14    6 Dictionary of Algorithms and Data Structures
      15    6 How Not to Sort by Average Rating
      16    6 How to Be a Programmer: a Short, Comprehensive, and Personal Summary
      17    6 How to Write a Spelling Corrector
      18    6 Keep Your Identity Small
      19    6 Ten Rules for Web Startups
      20    6 The Bipolar Lisp Programmer
      21    6 The Tao of Programming
      22    6 Why I Hate Frameworks
      23    6 Why Lisp?
      24    5 A Regular Expression to Check for Prime Numbers
      25    5 A Spellchecker Used to Be a Major Feat of Software Engineering
      26    5 Akin's Laws of Spacecraft Design
      27    5 Ask HN: Idea Sunday
      28    5 Ask HN: What Are You Working On?
      29    5 Ask HN: Who's Hiring?
      30    5 Beej's Guide to Network Programming
      31    5 Data Structure Visualizations
      32    5 Dna Seen Through the Eyes of a Coder
      33    5 Game Programming Patterns
      34    5 How Software Companies Die
      35    5 How to Become a Good Theoretical Physicist
      36    5 How to Design Programs, Second Edition
      37    5 Latency Numbers Every Programmer Should Know
      38    5 Learn C and Build Your Own Lisp
      39    5 Learn You a Haskell for Great Good
      40    5 Learning Advanced Javascript
      41    5 Let's Build a Compiler
      42    5 Notation As a Tool of Thought
      43    5 Statistical Data Mining Tutorials
      44    5 Structure and Interpretation of Classical Mechanics
      45    5 Teach Yourself Programming In Ten Years
      46    5 Terms of Service; Didn't Read
      47    5 The Book of Shaders
      48    5 The Case of the 500-Mile Email
      49    5 The Early History of Smalltalk
      50    5 The Scientist and Engineer's Guide to Digital Signal Processing
      51    5 What ORMs Have Taught Me: Just Learn SQL
      52    5 Who Can Name the Bigger Number?
      53    5 Why I Left Google
      54    5 Why Lisp Macros Are Cool, a Perl Perspective
      55    5 Why Open Source Misses the Point of Free Software
      56    5 Why to Start a Startup In a Bad Economy
      57    5 You Can't Tell People Anything


NB: I've further tweaked my scripts (and titlecasing) which results in a few minor changes to the results, though the above list remains highly illustrative.

I think I've spammed this thread enough for now ;-)


<Ahem> A list with a link to the most-commented post would be really cool for, uh..., other people who might not have seen them all.


That would involve Real Work.

The simplest route to that for now would be to search the titles via Algolia, sorting by popularity. I could I suppose gin up the URLs for that, though as I've already noted, I think I've spammed this particular thread enough with long-list comments. (HN prizes intellectual curiosity, and whilst a few tables might meet that criterion, I think I'm pushing the limits.)

The difference between my data & analysis and Algolia is that Algolia doesn't itself report on either front-page-specific items, or on stories which have been repeated. But given a list of front-page stories, or repeated-front-page stories, you can search Algolia ... to surface all instances of those stories. The front page will in general have

If you're suggesting I list the URLs themselves directly ... as I'm working with the archive data, I don't have that readily available.

My current workflow is, roughly:

- crawl (or update) the front-page archive

- rerender the captured HTML as plain text, using w3m's `-dump` flag

- parse that text into a tagged multi-line-per-record format with the raw title line, parsed title, date (and several sub-element parsings of that), site ( as reportedd by HN), points, submitter, comments, and (artefact of the original question I'd sought to answer) any US cities or states mentioned

- create various reports and abstracts based off of that. "hn-titles", "date summary" (mostly the parsed data arranged on one line for easier awk processing), cities (US and "globally significant") and US states reports, etc.

Conspicuously absent in the parsing (3rd step) are both the full article URL, and the HN post URL.

I've got those, in the raw HTML, but I'd need to go through that and parse the original which up until now has been Too Much Work given what I can do with what I have now.

And if you're wondering how many votes are required to make the front page, here's a summary, by year, of the univariate stats for votes for the 30th story per page, that is, the lowest-ranked:

  Year: 2007 (days: 300)
  n: 172, sum: 427, min: 1, max: 6, mean: 2.482558, median: 2, sd: 1.131546
  
  Year: 2008 (days: 366)
  n: 172, sum: 1442, min: 2, max: 16, mean: 8.383721, median: 8, sd: 3.181337
  
  Year: 2009 (days: 365)
  n: 172, sum: 3493, min: 7, max: 55, mean: 20.308140, median: 20, sd: 6.398405
  
  Year: 2010 (days: 365)
  n: 172, sum: 6582, min: 20, max: 59, mean: 38.267442, median: 39, sd: 8.689477
  
  Year: 2011 (days: 364)
  n: 172, sum: 10312, min: 28, max: 89, mean: 59.953488, median: 61, sd: 13.266858
  
  Year: 2012 (days: 366)
  n: 172, sum: 12492, min: 31, max: 150, mean: 72.627907, median: 74, sd: 17.430260
  
  Year: 2013 (days: 365)
  n: 172, sum: 14354, min: 44, max: 184, mean: 83.453488, median: 82, sd: 21.248547
  
  Year: 2014 (days: 363)
  n: 172, sum: 14513, min: 5, max: 131, mean: 84.377907, median: 85, sd: 19.878349
  
  Year: 2015 (days: 365)
  n: 172, sum: 14770, min: 19, max: 332, mean: 85.872093, median: 70, sd: 47.614140
  
  Year: 2016 (days: 365)
  n: 172, sum: 19451, min: 29, max: 352, mean: 113.087209, median: 97, sd: 61.186786
  
  Year: 2017 (days: 365)
  n: 172, sum: 21843, min: 36, max: 588, mean: 126.994186, median: 103, sd: 76.123514
  
  Year: 2018 (days: 365)
  n: 172, sum: 23678, min: 27, max: 430, mean: 137.662791, median: 111, sd: 80.183359
  
  Year: 2019 (days: 365)
  n: 172, sum: 23138, min: 31, max: 491, mean: 134.523256, median: 116.5, sd: 76.338079
  
  Year: 2020 (days: 366)
  n: 172, sum: 25700, min: 23, max: 551, mean: 149.418605, median: 127.5, sd: 89.049755
  
  Year: 2021 (days: 365)
  n: 172, sum: 28075, min: 50, max: 507, mean: 163.226744, median: 134.5, sd: 90.869361
  
  Year: 2022 (days: 365)
  n: 172, sum: 27565, min: 40, max: 409, mean: 160.261628, median: 139.5, sd: 76.489698
  
  Year: 2023 (days: 172)
  n: 172, sum: 27805, min: 43, max: 616, mean: 161.656977, median: 129.5, sd: 97.531724

Note that the first three years were pretty low (min = 1, mean = 2.48, for 2007), but goinng back 5 years a story with > 23 points could have made the front page. There are also a few days with < 30 stories, all occurring in 2007 if memory serves.

(This is my first time seeing these particular stats, another analysis I'd been thinking of doing for a while. I had calculated the delta between 1st and 30th ranked stories going back some time. Also the variance by day of week, which is also fairly significant.)


Nice work! Thank you for taking my feature requests to heart :)


This is the fun of posting my work publicly. Someone will pick it apart, and then it's a challenge to see if I can address the concerns.

I knew my first attempt was quick-and-dirty. It took a few minutes to improve that a lot, a few hours to track down numerous other issues (mostly involving title-casing exceptions, and adding more terms to that script which should either not be titlecased (e.g., "DNA", or which are mixed-case (e.g., iPhone), or which are ambiguous and should be treated differently in different cases (e.g., "us", which might be a first-person plural pronoun, or an abbreviation for "United States").

For the latter, I determined that "US" appearing either immediately after "the" or at the start of a title was virtually always the United States sense of the string. I've enshrined that in `titlecase`, though there are probably some other terms that tend to occur before or after the term which could be used to further disambiguate, say, "US Congress", "US Senate", or "US Law", for example. The additional gain from those is small.

If I were writing an AI then it might incorporate those weights, but this is just a simple sed script...


s/uppercases words/titlecases phrases/

That is, it takes text in some aRbiTRArY casing and AppLIEs pAttErNs to crEatE stanDArD TiTle casE preSenTation.

Above line run through the script:

That is, It Takes Text In Some Arbitrary Casing and Applies Patterns to Create Standard Title Case Presentation

(It's not perfect, but works remarkably well. Occasional hand-tweaks are required, and in bulk operations the results are usually appropriate.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: