Hacker News new | past | comments | ask | show | jobs | submit login

RSS feeds are a bit of a mess, but only due to each publisher's implementation being slightly different. the vast majority only send a headline and summary on the RSS feed, i then have to go and scrape and extract the article on the backend to populate the content, which is it's own challenge and i've not gotten it to work for a few sites yet.

This is also running in a semi-serverless container in Google Cloud Run (only costs me £1 a month!) so fetching and re-caching all of that when a new container is scheduled is painful, however it seems like state in the container is persisted longer than i initially thought, so it's good enough for now.




You might have come across this already, but we maintain a collection of article extraction rules for various sites here https://github.com/fivefilters/ftr-site-config - it was adapted from a database maintained by Instapaper in its early days and today has contributions mainly from users and developers of an open source Instapaper/Pocket alternative called Wallabag: https://github.com/wallabag/wallabag

Also usable with a free version of Full-Text RSS available here: https://bitbucket.org/fivefilters/full-text-rss/src/master/


This looks really interesting! I've been building an RSS reader for myself in my free time, and this will be really useful. I was wondering if you know anything about the legal implications around scraping full-content like this and packaging it up? I was planning to do it with some fun added-on features, but was worried it would be considered copy-right infringement (since I would basically be re-hosting other site's content without permission). And some websites outright ban this kind of usage in the TOS for their RSS feeds. For example, from the Washington Post[1]

> a. For any article, you may not display more text than we provide in the RSS feed.

[1]: https://www.washingtonpost.com/rss-terms-of-service/2012/01/...


AWESOME! Thanks so much, I'll look into that. I think it was the Instapaper service I was using back in the day.


Cool, thanks! I'm doing a lot with AWS serverless, it's a great way to go (and similarly cheap). Any chance you could open-source your scraping code?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: