XPath Scraping with FreshRSS

Nextgrid · on Dec 27, 2023

I wonder how this would work with more and more sites behind Cloudflare and stuff. Websites really don't want this since in today's economy, wasting a human's time is paramount (they call it "engagement"). A computer, even working on behalf of a human is not enough.

aebtebeten · on Dec 27, 2023

Fortunately I find the value of a website to be often in inverse relation to how much it cares about engagement.

(fwiw everyone I've wanted to follow has already had a feed; it's just that sometimes I've had to grovel around with View Source to find it)

derac · on Dec 27, 2023

Curl impersonate gets right through Cloudflare. ;)

Hamuko · on Dec 26, 2023

Note that this is only needed if the website has no RSS feed whatsoever. If the website has a partial/truncated RSS feed that only contains headlines/partial text, you can use the "Article CSS selector on original website" feature.

That will retrieve the list of items using RSS but will fetch the article content by getting the URL and grabbing the HTML element you specified (like "article.post").

aebtebeten · on Dec 27, 2023

As my use case for RSS is "seeing a blurb for the 99% of things I don't wish to read, once and only once", instead of a "partial/truncated RSS feed" that only contains headlines/partial text I'd call that an "unbloated feed".

mdaniel · on Dec 27, 2023

I tried to dig it out of the PHP source but without a local checkout it was non-obvious: I wonder if any such "synthetic rss feed" system honors the ETag and/or Last-Modified and/or cache headers of the target page, or if every rss feed refresh unconditionally loads the upstream page only to throw away 90% of the generated html

And that's not even getting into the raging tire fire that is akamai / cloudflare / whatever anti-bot technologies. I did see support for http proxies, but it wasn't clear if that was something one could set on a per-xpath-feed basis or whether the whole system had to run under one proxy (potentially $$$)

Nextgrid · on Dec 27, 2023

Proxies won't help - most of the aforementioned providers now do TCP, TLS and browser fingerprinting and use a heuristic approach. You need to be able to provide a consistent fingerprint to all of those to pass.

Proxies are actually pretty useless, you need to go one layer lower to be able to fake those fingerprints, what you need is a VPN instead (by VPN I mean an IP-level tunnel, not a public VPN provider - those IPs are already blacklisted and often are so for good reason).

wraptile · on Dec 27, 2023

XPath still rules at web scraping though often CSS selectors are better. With CSS selectors it's much harder for the user to shoot themselves in the foot with their selector design and most web devs already know them. So, really it's best to mix both: CSS for most and fallback to XPath for more complex operations.

For example, the xpath selector used in the article `//li[@class="blog__post-preview"]` would break if one more class is added which happens very often in the real world while CSS selector`li.blog__post-preview` wouldn't. (The correct xpath here would be `//li[contains(@class,"blog__post-preview")]` or even more accurately `//li[contains(concat(" ", normalize-space(@class), " "), " blog__post-preview ")]` - yeah it's realy ugly).

Either way both CSS and XPath selectors are actually really easy to learn and would be great if more tools adopted these methods like FreshRSS did! I made some interactive cheatsheets with all of the edge cases if anyone is interested in all of the web scraping parsing weirdness :) https://scrapfly.io/blog/css-selector-cheatsheet/ and https://scrapfly.io/blog/xpath-cheatsheet/

daniel65464 · on Dec 26, 2023

Great post. You could also use Automize.dev to generate those xpaths easier ;) Disclaimer: I am the author of Automize.

3abiton · on Dec 27, 2023

I would be interested in a full integrated workflow demo.

Beijinger · on Dec 26, 2023

Nice one. Former feedity.com now charges a tiny fee of USD 41 p.M. for this feature :-(

stl_fan · on Dec 27, 2023

I've been using Feedbro, but interested in a [self-]hosted solution so will probably check out FreshRSS! Thanks!

upcoming-sesame · on Dec 27, 2023

I've been using "Feed me up Scotty"[1] in combination with GitHub Actions (free) to periodically fetch and build my RSS feeds.

The only part that is still inconvenient is creating these xpaths which requires some trial and error.

1. https://gitlab.com/vincenttunru/feeds

kartoshechka · on Dec 27, 2023

rss bridge [1] seems to do the same, but it's not coupled to any rss reader

[1] https://github.com/RSS-Bridge/rss-bridge