Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
XPath Scraping with FreshRSS (danq.me)
79 points by ulrischa on Dec 26, 2023 | hide | past | favorite | 14 comments


I wonder how this would work with more and more sites behind Cloudflare and stuff. Websites really don't want this since in today's economy, wasting a human's time is paramount (they call it "engagement"). A computer, even working on behalf of a human is not enough.


Fortunately I find the value of a website to be often in inverse relation to how much it cares about engagement.

(fwiw everyone I've wanted to follow has already had a feed; it's just that sometimes I've had to grovel around with View Source to find it)


Curl impersonate gets right through Cloudflare. ;)


Note that this is only needed if the website has no RSS feed whatsoever. If the website has a partial/truncated RSS feed that only contains headlines/partial text, you can use the "Article CSS selector on original website" feature.

That will retrieve the list of items using RSS but will fetch the article content by getting the URL and grabbing the HTML element you specified (like "article.post").


As my use case for RSS is "seeing a blurb for the 99% of things I don't wish to read, once and only once", instead of a "partial/truncated RSS feed" that only contains headlines/partial text I'd call that an "unbloated feed".


I tried to dig it out of the PHP source but without a local checkout it was non-obvious: I wonder if any such "synthetic rss feed" system honors the ETag and/or Last-Modified and/or cache headers of the target page, or if every rss feed refresh unconditionally loads the upstream page only to throw away 90% of the generated html

And that's not even getting into the raging tire fire that is akamai / cloudflare / whatever anti-bot technologies. I did see support for http proxies, but it wasn't clear if that was something one could set on a per-xpath-feed basis or whether the whole system had to run under one proxy (potentially $$$)


Proxies won't help - most of the aforementioned providers now do TCP, TLS and browser fingerprinting and use a heuristic approach. You need to be able to provide a consistent fingerprint to all of those to pass.

Proxies are actually pretty useless, you need to go one layer lower to be able to fake those fingerprints, what you need is a VPN instead (by VPN I mean an IP-level tunnel, not a public VPN provider - those IPs are already blacklisted and often are so for good reason).


XPath still rules at web scraping though often CSS selectors are better. With CSS selectors it's much harder for the user to shoot themselves in the foot with their selector design and most web devs already know them. So, really it's best to mix both: CSS for most and fallback to XPath for more complex operations.

For example, the xpath selector used in the article `//li[@class="blog__post-preview"]` would break if one more class is added which happens very often in the real world while CSS selector`li.blog__post-preview` wouldn't. (The correct xpath here would be `//li[contains(@class,"blog__post-preview")]` or even more accurately `//li[contains(concat(" ", normalize-space(@class), " "), " blog__post-preview ")]` - yeah it's realy ugly).

Either way both CSS and XPath selectors are actually really easy to learn and would be great if more tools adopted these methods like FreshRSS did! I made some interactive cheatsheets with all of the edge cases if anyone is interested in all of the web scraping parsing weirdness :) https://scrapfly.io/blog/css-selector-cheatsheet/ and https://scrapfly.io/blog/xpath-cheatsheet/


Great post. You could also use Automize.dev to generate those xpaths easier ;) Disclaimer: I am the author of Automize.


I would be interested in a full integrated workflow demo.


Nice one. Former feedity.com now charges a tiny fee of USD 41 p.M. for this feature :-(


I've been using Feedbro, but interested in a [self-]hosted solution so will probably check out FreshRSS! Thanks!


I've been using "Feed me up Scotty"[1] in combination with GitHub Actions (free) to periodically fetch and build my RSS feeds.

The only part that is still inconvenient is creating these xpaths which requires some trial and error.

1. https://gitlab.com/vincenttunru/feeds


rss bridge [1] seems to do the same, but it's not coupled to any rss reader

[1] https://github.com/RSS-Bridge/rss-bridge




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: