Hacker News new | past | comments | ask | show | jobs | submit login
Returning the “killed” RSS of Reuters from the dead (codarium.substack.com)
96 points by artembugara on June 21, 2020 | hide | past | favorite | 37 comments



I work on a project which helps you produce RSS feeds from web pages that don't offer their own using the page URL as input and simple selectors to identify the web page elements to be used in the feed.

Here's what it can produce for the Reuters World News mobile page:

https://createfeed.fivefilters.org/index.php?url=https%3A%2F...

Here's what it can do for the main site:

http://createfeed.fivefilters.org/index.php?url=https%3A%2F%...

The downside is that certain changes to the HTML structure (e.g site renaming/removing class attribute values used as selectors) could cause the feeds to break.


As a dev I really like your website. Ultimate no bullshit product and I got all the information I need in a single page. Congrats.


Thank you! Glad you liked it. :)


+1 to GP. Seeing this kind of products make happy in general. (Or at least appeal to my confirmation bias about how products should be built)

FYI, you _probably_ have a typo on pricing page "Term Extraction" should be "Term Extension"?


A major problem with Reuters' RSS feeds while they lasted is that Reuters pushes new URLs with updates to existing stories and kills or redirects the previous URLs. So for a major developing story you'd see the same article in your feed 5+ times, since the feed was just a dumb push of every URL added to whatever category you were subscribed to. Still better than nothing, I guess.

The issue with this solution is that there doesn't seem to be any way to specialize it at all. I was subscribed to Reuters' politics feed [1] specifically, and I got other types of news from other sources. But I don't see any way to do that with this method. The articles unfortunately do not have the category in the URL.

[1] http://feeds.reuters.com/Reuters/PoliticsNews


Printing each story exactly once is tough. The RSS feed serial number thing never worked. Sites with multiple RSS servers and a load balancer would return a different serial number. I ended up taking the MD5 of the title and description fields, with HTML markup deleted, and discarding new feed items with a duplicated MD5.


That would be a start, but unfortunately Reuters frequently changes their article titles too.


Artem, do you know if there is a parameter for ordering the RSS results by time for the Google News RSS results? This is very helpful.

Finally, I pay $2000+ annually for a competitor to NewsCatcherAPI. It may be worth connecting. I signed up for a trial earlier and the two issues for me would be (i) the range and depth of publications and (ii) not being about to track mentions / references in the body of the article.

I may not be your target audience but one of your competitors is pulling in 10m articles with the full article content per day. I use the API for timely alerts for PR monitoring -- my priority is that I pick up mentions of companies / individuals wherever they happen quickly.

Your pricing is a lot more competitive. I just wonder whether you are looking to move in the direction of range and depth in the future, or whether you're targeting a different market segment.


What competitor is this, if you don't mind me asking? Could use a service like that for my own project.


Could you connect with me at artem@newscatcherapi.com?


Hey, thx for reply. I’ll email you later on today.

Regarding the sort, unfortunately, no. I am pretty sure this feature is disabled.


Just curious, what are reasons businesses would pay for a NewsCatcher API subscription? What are they using it for?

Fun API, though.


Integrate some kind of news feed to their platforms.

Analyzing the PR campaigns, market.

Building their own news aggregators (usually theme specific).


Would I be able to build a niche topic focused site (e.g. video gaming news) based on the NewsCatcher API?


Yes, that is exactly how our service works.

Ping me at artem@newscatcherapi.com, or go to a live chat on our website


Obviously you could build an aggregator with an API that gives you a stream of news. People have been building niche news aggregators on Wordpress for almost two decades by plugging in some RSS feed urls.

It's just the kind of obvious lame idea that stops me from seeing more compelling usages of this sort of cool API.


I have a question about NewsCatcherAPI: what is the legal framework around indexing copyrighted articles and provide a paid API to search them?

Edit: particularly when Reuters' business is selling its newswire feed


Short answer: seems like OK as long as you do not resell the full body text.

Long answer: Google does the same thing everyday. Each country has its own laws. Usually, news are in a special category that is less protected with copyrights.


Yeah, and Google was sued and settled by agreeing to pay a fee to the news wire:

https://www.reuters.com/article/us-google-afp/afp-google-new...


To an organization in France. Doesn't sound like concerning precedent to a startup that doesn't share much in common with google.


I worked for a company doing this in the early 2000s. Perfectly legal to save website text for private use (a search for example). It doesn't need to be news either.


I was worried they had removed it completely. I'm using this for a simple web page which shows RSS feeds from multiple news sources across the political spectrum, and the Reuters feeds have been the essential pivot for me. It is the only source I significantly trust to be neutral in these trying times.

Thank you so incredibly much for this very simple solution. I was worried I'd have to spend many hours on some complicated fix. I'm glad I no longer have to (unless Google kills of their news RSS).


I have to agree on Reuters... They seem to make the effort to keep the context with unbiased framing.


I have an antique Teletype set up to print news from the Reuters news feed, and it's stopped working because of this. So I tried this new approach via Google. All you get is the RSS feed titles, not the content. The "description" is just a link to content elsewhere, with the title as link text. The real Reuters RSS feed had a few sentences of copy for each story, roughly what radio stations would read.

Associated Press seems to have dropped their RSS feeds too, or hidden them well.

The New York Times still has a usable RSS feed at: https://rss.nytimes.com/services/xml/rss/nyt/World.xml

This gets me, in Teletype format:

    CHINA SLAMS TRUMP OVER UIGHUR LAW AMID BOLTON ACCUSATIONS
    (JUNE 18TH, 7:16 PM)
    A NEW LAW AIMED AT PUNISHING CHINESE OFFICIALS INVOLVED IN MASS
    INTERNMENTS OF UIGHURS AND OTHER MINORITIES IN XINJIANG CAME AS
    JOHN BOLTON ACCUSED PRESIDENT TRUMP OF SUPPORTING BEIJING?S
    CRACKDOWN.


NYT's RSS feed may not be a good source if you want updates through the day. I left the program running and the RSS feed hasn't changed in hours. Maybe it changes all at once when the next edition comes out.

Alternatives:

CNN feed: http://rss.cnn.com/rss/cnn_topstories.rss

The Guardian: https://www.theguardian.com/world/rss

Washington Post: http://feeds.washingtonpost.com/rss/world


There's a range of BBC News feeds available too: https://www.bbc.co.uk/news/10628494


Do you ever tare off the latest breaking headline and rush into another room and slam it on a desk and say "you're gunna want to see this!"?


It's at home.[1] This was from my steampunk period.

[1] http://www.aetherltd.com/images/tty14ro/printerathome1.jpg


I read RSS feeds with an app on Android called Aggregator. Reuters feeds included a short description of the story inside the feed, but Google News doesn't have them. The descriptions in the feed allowed me to precisely filter and label the entries based on keywords.

Anyway, I also use the Google News RSS trick described in the article, as replacement for now. Not sure how long it will last, however.


I've always wondered how the copyright for the Google RSS works..

As far as I am aware it is an undocumented API that can be called without authentication or acceptance of any EULA.

The response includes an explicit copyright field.

Does that mean the feed can not be reproduced? Can a derivative work be created from it?


"Can a derivative work be created from it?"

Grey area, lived in it for a while and made some money there, but at the end of the day it depends on how the original creator feels about your derivative work - or you personally :)


Does this trick work for apnews.com as well?


Yep. It works for most news sites, it's just filtering the recent news from Google News based off the URL. Oddly, it doesn't work for CNN. Never understood why. Maybe "cnn" is a stop word to this search engine and ignored. Dunno.

And, of course, it works today. It's a Google product, so enjoy it while it lasts.


I had another look at what I could find on apnews.com itself. Apparently there's a big ugly JSON blurb buried in the home page, ready for the taking: https://github.com/stijnsanders/feeder/blob/master/eater/eat...


Brilliant. Thank yo!


Probably someone there read NH?


> Did it work? Consider subscribing to my newsletter to get more useful content like that. It’s free: (...) I am a co-founder of NewsCatcherAPI — ultra-fast API to find news articles by any topic, country, language, website, or keyword. ...

It looks like an unfortunately automatically placed ad, making me look for a continuaton of the actual content (a direct answer to the question) that never came.

Weird writing style.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: