One of my biggest gripes against HN's front page is that there's so little conte...

postalcoder · 2025-05-09T05:25:03 1746768303

Thanks! Great to hear it.

  > My own news-page rewrite includes several paragraphs of lede context, which is probably a bit on the overkill side. But a hundred characters or so should help.

Stay tuned, I've been thinking about the right way to do something like this too.

  > I'm also wrestling with the sort-order aspect. Current cut is time-ordered within sections (another thing I wish HN had), but I'm going to be extending the article count in the next iteration.

Hope you don't mind if I email you later for new feature feedback.

  > That said, your design is clean and light, I like it.

Thank you.

dredmorbius · 2025-05-09T13:57:04 1746799024

A key problem with extracting article context is that there are so many distinct sources.

That said, power laws and Zipf functions apply, and a large fraction of HN front-page articles come from a relatively small set of domains. There's further aggregation possible when underlying publishing engines can be identified, e.g., Wordpress, CMSes used by a large number of news organisations, Medium, Substack, Github, Gitlab, Fediverse servers, and a number of static site generators (Hugo, Jekyll, Pelican, Gatsby, etc.).

I suspect you're aware of most of this.

I have a set of front-page sites from an earlier scraping project:

(For the life of me I cannot remember what the 3rd column represents, though it may be a miscalculated cumulative percentage. The "category" field was manually supplied by me, every site with > 17 appearances has one, as well as several below that threshold which could be identified by other means, e.g., regexes on blogging engines, GitHub pages, etc.)

  Rank  Count    ???  Site :::: Category
  ------------------------------------------------------------- 
     1  7294   5.175  n/a :::: n/a
     2  3803   7.873  nytimes.com :::: general news
     3  3495  10.352  techcrunch.com :::: tech news
     4  1580  11.473  arstechnica.com :::: tech news
     5  1344  12.426  bloomberg.com :::: business news
     6  1288  13.340  wired.com :::: tech news
     7  1171  14.171  wsj.com :::: business news
     8  1099  14.951  youtube.com :::: video
     9  1026  15.678  wikipedia.org :::: general info (wiki)
    10   921  16.332  bbc.com :::: general news
    11   911  16.978  bbc.co.uk :::: general news
    12   893  17.612  theguardian.com :::: general news
    13   866  18.226  washingtonpost.com :::: general news
    14   846  18.826  reuters.com :::: general news
    15   829  19.414  economist.com :::: business news
    16   781  19.968  theatlantic.com :::: general interest
    17   631  20.416  arxiv.org :::: academic / science
    18   628  20.862  npr.org :::: general news
    19   622  21.303  nature.com :::: academic / science
    20   614  21.738  newyorker.com :::: general interest
    21   505  22.097  eff.org :::: law
    22   475  22.434  stanford.edu :::: academic / science
    23   471  22.768  ieee.org :::: technology
    24   456  23.091  reddit.com :::: general discussion
    25   448  23.409  amazon.com :::: corporate comm.
    26   445  23.725  microsoft.com :::: technology
    27   416  24.020  theverge.com :::: tech news
    28   410  24.311  venturebeat.com :::: business news
    29   408  24.600  quantamagazine.org :::: academic / science
    30   407  24.889  cnn.com :::: general news

17,782 sites in total, if I'm reading my past notes correctly.

More on that project in an HN search: <https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...>

(Individual comments/posts seem presently unreachable due to an HN site bug.)

dredmorbius · 2025-05-09T15:05:46 1746803146

Further thoughts on article extraction: one idea that comes to mind is including extraction rules in the source selection metadata.

I'm using something along these lines right now to process sections within a given source, where I define the section-distinguishing-element from a headline URL, as well as the plaintext, position (within my generated page), lines of context, and maximum age (days) I'm interested in.

That could be extended or paired with a per-source rule that identifies the htmlq specifiers which pull out title, dateline, and byline elements from the source.

A further challenge is that such specifiers have a tendency to change as the publisher's back-end CMS varies, and finding ways to identify those is ... difficult.

But grist for the mill, at any rate.