RSS feeds are a bit of a mess, but only due to each publisher's implementation b...

k1m · on April 14, 2020

You might have come across this already, but we maintain a collection of article extraction rules for various sites here https://github.com/fivefilters/ftr-site-config - it was adapted from a database maintained by Instapaper in its early days and today has contributions mainly from users and developers of an open source Instapaper/Pocket alternative called Wallabag: https://github.com/wallabag/wallabag

Also usable with a free version of Full-Text RSS available here: https://bitbucket.org/fivefilters/full-text-rss/src/master/

jkeuhlen · on April 14, 2020

This looks really interesting! I've been building an RSS reader for myself in my free time, and this will be really useful. I was wondering if you know anything about the legal implications around scraping full-content like this and packaging it up? I was planning to do it with some fun added-on features, but was worried it would be considered copy-right infringement (since I would basically be re-hosting other site's content without permission). And some websites outright ban this kind of usage in the TOS for their RSS feeds. For example, from the Washington Post[1]

> a. For any article, you may not display more text than we provide in the RSS feed.

[1]: https://www.washingtonpost.com/rss-terms-of-service/2012/01/...

chris_st · on April 14, 2020

AWESOME! Thanks so much, I'll look into that. I think it was the Instapaper service I was using back in the day.

chris_st · on April 14, 2020

Cool, thanks! I'm doing a lot with AWS serverless, it's a great way to go (and similarly cheap). Any chance you could open-source your scraping code?