RSS feeds are a bit of a mess, but only due to each publisher's implementation being slightly different. the vast majority only send a headline and summary on the RSS feed, i then have to go and scrape and extract the article on the backend to populate the content, which is it's own challenge and i've not gotten it to work for a few sites yet.
This is also running in a semi-serverless container in Google Cloud Run (only costs me £1 a month!) so fetching and re-caching all of that when a new container is scheduled is painful, however it seems like state in the container is persisted longer than i initially thought, so it's good enough for now.
You might have come across this already, but we maintain a collection of article extraction rules for various sites here https://github.com/fivefilters/ftr-site-config - it was adapted from a database maintained by Instapaper in its early days and today has contributions mainly from users and developers of an open source Instapaper/Pocket alternative called Wallabag: https://github.com/wallabag/wallabag
This looks really interesting! I've been building an RSS reader for myself in my free time, and this will be really useful. I was wondering if you know anything about the legal implications around scraping full-content like this and packaging it up? I was planning to do it with some fun added-on features, but was worried it would be considered copy-right infringement (since I would basically be re-hosting other site's content without permission). And some websites outright ban this kind of usage in the TOS for their RSS feeds. For example, from the Washington Post[1]
> a. For any article, you may not display more text than we provide in the RSS feed.
This is also running in a semi-serverless container in Google Cloud Run (only costs me £1 a month!) so fetching and re-caching all of that when a new container is scheduled is painful, however it seems like state in the container is persisted longer than i initially thought, so it's good enough for now.