Not all appearances are front-page appearances. And my very quick & dirty shell one-liner requires precise text matches.
"Why I Hate Frameworks" does appear six times on the front page, though with case and phrasing variants which mean that an exact text match won't pick it up:
Why I Hate Frameworks
Why I Hate Frameworks
Why I hate frameworks (2005)
Why I Hate Frameworks
[dupe] Why I Hate Frameworks (2005)
Why I Hate Frameworks (2005)
By my shell one-liner, it would have been counted as appearing four times (instances 1, 2, 4, and 6 above).
I could tweak the script by lowercasing (or title-casing) titles (I've a script that does the latter), and eliminating any instances of "[dupe]". While I'm at it, "[flagged]" and "[dead]" possibly as well, though I think the latter won't appear in the archive. And let's convert all apostrophe variants to a single apostrophe character (') (ASCII hex 0x27).
With those adjustments, I see "Why I Hate Frameworks" appearing six times, as expected, and there are 57 titles appearing 5+ times, rather more than the 39 initially posted above. There are 3,077 repeated front-page titles in total.
The one-liner, FWIW (rearranged to multiple lines):
"titlecase" itself is a sed script that uppercases words, with specific exceptions. I've just augmented it with a few more technology-related terms.
"parse.log" is the output of my first stage of parsing of the HN front page archive, which lists specific story properties in a tagged format, e.g., "Title:" "Date:" "Site:", etc.
1 14 I'm Peter Roberts, Immigration Attorney Who Does Work for YC and Startups. AMA
2 11 Richard Feynman and the Connection Machine
3 10 Openssl Security Advisory
4 10 The TTY Demystified
5 8 Why GNU Grep is Fast
6 7 Cool URIs Don't Change
7 7 How to Read Mathematics
8 7 The Architecture of Open Source Applications
9 7 You and Your Research
10 6 -2000 Lines of Code
11 6 A Primer On Bézier Curves
12 6 Advanced Programming Languages
13 6 Bit Twiddling Hacks
14 6 Dictionary of Algorithms and Data Structures
15 6 How Not to Sort by Average Rating
16 6 How to Be a Programmer: a Short, Comprehensive, and Personal Summary
17 6 How to Write a Spelling Corrector
18 6 Keep Your Identity Small
19 6 Ten Rules for Web Startups
20 6 The Bipolar Lisp Programmer
21 6 The Tao of Programming
22 6 Why I Hate Frameworks
23 6 Why Lisp?
24 5 A Regular Expression to Check for Prime Numbers
25 5 A Spellchecker Used to Be a Major Feat of Software Engineering
26 5 Akin's Laws of Spacecraft Design
27 5 Ask HN: Idea Sunday
28 5 Ask HN: What Are You Working On?
29 5 Ask HN: Who's Hiring?
30 5 Beej's Guide to Network Programming
31 5 Data Structure Visualizations
32 5 Dna Seen Through the Eyes of a Coder
33 5 Game Programming Patterns
34 5 How Software Companies Die
35 5 How to Become a Good Theoretical Physicist
36 5 How to Design Programs, Second Edition
37 5 Latency Numbers Every Programmer Should Know
38 5 Learn C and Build Your Own Lisp
39 5 Learn You a Haskell for Great Good
40 5 Learning Advanced Javascript
41 5 Let's Build a Compiler
42 5 Notation As a Tool of Thought
43 5 Statistical Data Mining Tutorials
44 5 Structure and Interpretation of Classical Mechanics
45 5 Teach Yourself Programming In Ten Years
46 5 Terms of Service; Didn't Read
47 5 The Book of Shaders
48 5 The Case of the 500-Mile Email
49 5 The Early History of Smalltalk
50 5 The Scientist and Engineer's Guide to Digital Signal Processing
51 5 What ORMs Have Taught Me: Just Learn SQL
52 5 Who Can Name the Bigger Number?
53 5 Why I Left Google
54 5 Why Lisp Macros Are Cool, a Perl Perspective
55 5 Why Open Source Misses the Point of Free Software
56 5 Why to Start a Startup In a Bad Economy
57 5 You Can't Tell People Anything
NB: I've further tweaked my scripts (and titlecasing) which results in a few minor changes to the results, though the above list remains highly illustrative.
I think I've spammed this thread enough for now ;-)
The simplest route to that for now would be to search the titles via Algolia, sorting by popularity. I could I suppose gin up the URLs for that, though as I've already noted, I think I've spammed this particular thread enough with long-list comments. (HN prizes intellectual curiosity, and whilst a few tables might meet that criterion, I think I'm pushing the limits.)
The difference between my data & analysis and Algolia is that Algolia doesn't itself report on either front-page-specific items, or on stories which have been repeated. But given a list of front-page stories, or repeated-front-page stories, you can search Algolia ... to surface all instances of those stories. The front page will in general have
If you're suggesting I list the URLs themselves directly ... as I'm working with the archive data, I don't have that readily available.
My current workflow is, roughly:
- crawl (or update) the front-page archive
- rerender the captured HTML as plain text, using w3m's `-dump` flag
- parse that text into a tagged multi-line-per-record format with the raw title line, parsed title, date (and several sub-element parsings of that), site ( as reportedd by HN), points, submitter, comments, and (artefact of the original question I'd sought to answer) any US cities or states mentioned
- create various reports and abstracts based off of that. "hn-titles", "date summary" (mostly the parsed data arranged on one line for easier awk processing), cities (US and "globally significant") and US states reports, etc.
Conspicuously absent in the parsing (3rd step) are both the full article URL, and the HN post URL.
I've got those, in the raw HTML, but I'd need to go through that and parse the original which up until now has been Too Much Work given what I can do with what I have now.
And if you're wondering how many votes are required to make the front page, here's a summary, by year, of the univariate stats for votes for the 30th story per page, that is, the lowest-ranked:
Note that the first three years were pretty low (min = 1, mean = 2.48, for 2007), but goinng back 5 years a story with > 23 points could have made the front page. There are also a few days with < 30 stories, all occurring in 2007 if memory serves.
(This is my first time seeing these particular stats, another analysis I'd been thinking of doing for a while. I had calculated the delta between 1st and 30th ranked stories going back some time. Also the variance by day of week, which is also fairly significant.)
This is the fun of posting my work publicly. Someone will pick it apart, and then it's a challenge to see if I can address the concerns.
I knew my first attempt was quick-and-dirty. It took a few minutes to improve that a lot, a few hours to track down numerous other issues (mostly involving title-casing exceptions, and adding more terms to that script which should either not be titlecased (e.g., "DNA", or which are mixed-case (e.g., iPhone), or which are ambiguous and should be treated differently in different cases (e.g., "us", which might be a first-person plural pronoun, or an abbreviation for "United States").
For the latter, I determined that "US" appearing either immediately after "the" or at the start of a title was virtually always the United States sense of the string. I've enshrined that in `titlecase`, though there are probably some other terms that tend to occur before or after the term which could be used to further disambiguate, say, "US Congress", "US Senate", or "US Law", for example. The additional gain from those is small.
If I were writing an AI then it might incorporate those weights, but this is just a simple sed script...
Possibly related: have you considered making your aggregator ignore punctuation and case in addition to the year indicators?