One of my biggest gripes against HN's front page is that there's so little context to go on --- just an <= 80 character headline, often not especially informative.
My own news-page rewrite includes several paragraphs of lede context, which is probably a bit on the overkill side. But a hundred characters or so should help.
I'm also wrestling with the sort-order aspect. Current cut is time-ordered within sections (another thing I wish HN had), but I'm going to be extending the article count in the next iteration.
That said, your design is clean and light, I like it.
> My own news-page rewrite includes several paragraphs of lede context, which is probably a bit on the overkill side. But a hundred characters or so should help.
Stay tuned, I've been thinking about the right way to do something like this too.
> I'm also wrestling with the sort-order aspect. Current cut is time-ordered within sections (another thing I wish HN had), but I'm going to be extending the article count in the next iteration.
Hope you don't mind if I email you later for new feature feedback.
> That said, your design is clean and light, I like it.
A key problem with extracting article context is that there are so many distinct sources.
That said, power laws and Zipf functions apply, and a large fraction of HN front-page articles come from a relatively small set of domains. There's further aggregation possible when underlying publishing engines can be identified, e.g., Wordpress, CMSes used by a large number of news organisations, Medium, Substack, Github, Gitlab, Fediverse servers, and a number of static site generators (Hugo, Jekyll, Pelican, Gatsby, etc.).
I suspect you're aware of most of this.
I have a set of front-page sites from an earlier scraping project:
(For the life of me I cannot remember what the 3rd column represents, though it may be a miscalculated cumulative percentage. The "category" field was manually supplied by me, every site with > 17 appearances has one, as well as several below that threshold which could be identified by other means, e.g., regexes on blogging engines, GitHub pages, etc.)
Further thoughts on article extraction: one idea that comes to mind is including extraction rules in the source selection metadata.
I'm using something along these lines right now to process sections within a given source, where I define the section-distinguishing-element from a headline URL, as well as the plaintext, position (within my generated page), lines of context, and maximum age (days) I'm interested in.
That could be extended or paired with a per-source rule that identifies the htmlq specifiers which pull out title, dateline, and byline elements from the source.
A further challenge is that such specifiers have a tendency to change as the publisher's back-end CMS varies, and finding ways to identify those is ... difficult.
My own news-page rewrite includes several paragraphs of lede context, which is probably a bit on the overkill side. But a hundred characters or so should help.
I'm also wrestling with the sort-order aspect. Current cut is time-ordered within sections (another thing I wish HN had), but I'm going to be extending the article count in the next iteration.
That said, your design is clean and light, I like it.