So many feed readers, so many behaviors

vallode · on May 28, 2024

I'm not sure how I feel about the content of this entire post. RSS feeds are a more or less stagnant technology, mostly adopted "for fun" or by a niche of people who find them useful. The only way I can see to move forward is to make them as easy and painless to use as possible, the onus falls on both the creator and consumer of feeds (the websites and the clients... the user is completely out of the equation here in my opinion).

This kind of attitude reeks of the "you're using it wrong!" of the Linux world. Are you really telling me you are getting enough RSS feed requests to put a dent on your tech stack? Is your bandwidth overhead suffering that much (are you not caching?)? Make it painless and let's be thankful it isn't all web scrapers masquerading as users.

Mind-boggling problem to be angry about.

boricj · on May 28, 2024

Some of us are not serving our blogs through Cloudfare. In fact, I'm using a 5 year old entry-level Synology NAS located in my apartment to serve mine. I do that because it's already always-on by policy and therefore doesn't cost me anything extra to serve my blog from there, besides the DNS domain name.

Badly behaved RSS readers that download feeds uncached are wasting orders of magnitude more of bandwidth and CPU (gotta encrypt it for HTTPS) on my end than well-behaved clients that get served 304s. Some of them don't even set "Accept-Encoding: gzip" and download it uncompressed. That's hundreds of kilobytes per request wasted for nothing.

My blog doesn't see enough traffic to make this an issue for me, but I can see why this could be a real problem for popular blogs with loads of traffic.

madeofpalk · on May 28, 2024

I think it's generally a poor idea to serve a blog, especially a popular one with loads of traffic, from an entry-level Synology NAS.

rakoo · on May 28, 2024

I think it's generally a poor idea to require spending multiple hundreds of dollars to have your website on something you control.

boricj · on May 28, 2024

It's kept up-to-date automatically. It is appropriately firewalled to only expose ports 80 and 443, the administration panel and the rest of the services hosted on it are only reachable from my LAN. It only serves static content, no scripting language is enabled.

A determined attacker might be able to get in despite all of these precautions, but it's at least administered in a somewhat responsible manner. I highly doubt most servers exposed on the internet are.

oliwarner · on May 28, 2024

Right, and if you stuck that behind (eg) Cloudflare via cloudflared, it'd be faster for your readers, more secure for you (no direct access) and have no impact on your network or NAS's resources.

Right tools for the job. It's not a failing to use a cache.

madeofpalk · on May 28, 2024

You don't even need cloudflare! If your blog is only updated infrequently (>1/day), serving what is essentially static text should not be difficult.

muppetman · on May 28, 2024

I can't wait for the day clownfare suddenly put a price on this stuff they've been giving everyone for free. I find it hilarious people who get 200 hits a day on their blog think they need it.

oliwarner · on May 28, 2024

That's right but I'm not making an argument of necessity.

I'm saying it's better for both you and your users to keep network traffic at a proper CDN than a home ISP network. It'll be faster for everyone.

1oooqooq · on May 28, 2024

except cache is broken on most browsers thanks to https

still better than giving up to cloudflare

madeofpalk · on May 28, 2024

This is the second time today I've read a HN comment saying that HTTPS and caching are incompatible.

Can you explain what you mean by this? I don't understand what you're saying (it seems obviously incorrect?)

1oooqooq · on May 28, 2024

visit any page which should be cached. click any link. unplug. click back. dead end instead of cached page.

cache only works on https for extra assets. which is useless if you have a simple static site anyway

duckmysick · on May 28, 2024

I just tested it. Can't replicate.

Went to https://example.com, checked that it's cached in the inspector, clicked on the More information link which leads to https://www.iana.org/domains/example, unplugged my connection (went offline), clicked back. It showed the cached https://example.com. I clicked forward. It showed the cached https://www.iana.org/domains/example page. Clicked back/forward like a maniac, the pages switched seamlessly.

Repeated the same process on a non-cacheable page, it showed a dud when disconnected as expected.

Tested on Firefox and Chromium.

cqqxo4zV46cp · on May 28, 2024

What? No it isn’t.

muppetman · on May 28, 2024

I think it's a shame that the web has become so bloated with frameworks and images and video and ads that you _can't_ easily serve a highly trafficked website from a ~20Mbps connection. It shouldn't need Clownfare etc.

mdp2021 · on May 28, 2024

> RSS feeds [would be] a more or less stagnant technology, mostly adopted "for fun" or by a niche of people who find them useful

Pray tell, what would be the good alternative, for this «niche of people» who collect the news? We use RSS because news are collected through RSS.

And clients have to be properly, wisely configured: some publish monthly some much more than hourly; some are a feed per domain, others are thousands of feeds per domain... This constitutes a set-up problem.

8organicbits · on May 28, 2024

RSS is the best option if you are looking to avoid walled gardens. It's a great way to find content without search ads or social media ads. RSS readers put you in control, instead of algorithmic social network feeds that manipulate you into doom scrolling. I think RSS has been growing as the social networks enshittify.

cqqxo4zV46cp · on May 28, 2024

This is delusional. Are you not able to identify when a thing that you do isn’t very widely done?

mdp2021 · on May 28, 2024

> An odd interpretation of "democracy" is lurking according to which "my ignorance is worth as much as your knowledge"

~~ Isaac Asimov

> An odd interpretation of "democracy" is lurking according to which we should look at masses to take example, instead of warning

~~ mdp2021

mdp2021 · on May 28, 2024

Delusional over what?

Why should "wide adoption" be relevant? We are already very well informed (much too well informed) that "people" are """odd""". What are your assumptions?

If you need to drive a screw, and few people used screwdrivers - who cares? You'd still use screwdrivers even when people tried to use cakes or monkeys or simply stopped driving screws, would you not?

You already expect "people" to use cakes or monkeys for something when they would normally be expected to drive screws - actually, you expect to be surprised with much worse ideas becoming realities. So?

Screwdrivers remain relevant, and using them properly remains equally relevant. And especially so, when you note that people are there with loose parts stuck together because the cake smudged monkey was not precise!

mariusor · on May 28, 2024

I see very little in the article that's actually targeted at people that hold it wrong. My interpretation is that the rant is targeted at RSS service developers who should know better, and for whom you are inventing excuses to justify laziness or incompetence.

troupo · on May 28, 2024

It's a bigger issue than RSS, really. Fetching RSS is a simple GET request. It requires the most basic understanding of HTTP, and people still can't do it right: they don't know how to deal with standard headers, how to deal with standard response codes, how to send proper requests etc.

Do you think regular REST API calls to any other service are any different?

mimischi · on May 28, 2024

Any pointers for good resources to grok best practices?

troupo · on May 28, 2024

Not sure about best practices, but these two resources are a good reference point:

- Know Your HTTP Well: https://github.com/for-GET/know-your-http-well

- HTTP Decision Diagram: https://github.com/for-GET/http-decision-diagram

vaylian · on May 28, 2024

Cache-Control goes a long way in the right direction: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Ca...

vallode · on May 28, 2024

Very valid point, the frustrations here do share commonalities with the overall HTTP ecosystem.

hennell · on May 28, 2024

So you could be angry at web scrapers, but not at RSS readers making badly formed requests every 5 minutes?

Making it painless is letting readers use the options available, supporting the commonly used standards and maybe fixing small problems (I'd probably space trim that URL for example). It doesn't mean supporting systems that are sending poorly formed or badly controlled requests, regardless of the impact it has on your tech stack.

geor9e · on May 28, 2024

Conspiracy theory (that I genuinely believe): Consumers don't know what they want, and in reality they would love RSS if it was allowed to blossom. But, RSS readers make the web unmonetizable, so they've been actively destroyed from every angle, in favor of the enshittified feeds the world is now addicted to. There are so many ways to quietly bury a certain technology if the market incentives are stong enough. You can't algorithmically manipulate and addict someone who just follows accounts chronologically without ads. So silicon valley market forces tend to discourage RSS. Google killed their RSS reader once they realized browser-based feed browsing generated way more adsense profits. Other readers get VC investment and mysteriously their free version becomes unusable garbage and adoption plateaus. Not every controlling interest in a VCs portfolio is benevolent. Facebook, twitter, etc all make it against their terms of service to "scrape" your own friends privately shared posts - you can only see them via the walled garden of ads. Apples app store fees would drop if consumers understood the utopia an RSS based web has the potential of being, instead of a dozen addictive apps. But it was too free. RSS derails trillion dollar roadmaps. The incentives are clear, and silicon valley knows it.

cess11 · on May 28, 2024

It's like pissing in the street. No one gets hurt and the street isn't going to break, but there will be a smell and it's perceived as a rather rude behavior except in the case of animals and small children.

cess11 · on May 28, 2024

The downvotes remind me of a thing at a job. We had an API for programmatic access to our customer's data, and one customer had bought some expensive BI-solution that they wanted to feed with data from the system we provided. For months they came to us and complained that the API was broken and asked us to fix it.

When I looked in the logs I could see that they hit the API with a lot of requests in short succession that generated 403 responses, so we said that they need to look at how they authorise. After a while they returned and claimed our API was broken. Eventually I offered to look at their code and they were like 'yeah sure but your silly little PHP dev won't understand our perfect C# application'.

So I looked at it and it was a mess of auto-concurrent, nested, for-loops. If they had gotten any data out it would have been an explosion of tens of thousands of requests within a few seconds. They also didn't understand the Bearer-scheme and just threw the string they got from the auth-endpoint straight into the Authorization-header without any prefix. Maybe we should have answered with 400 instead of 403, but yeah, that would have been a breaking change and we didn't want or have time to do a new API version because of this.

Anyway, their tech-manager got really mad that I found the issue within an hour that they had struggled with for months, and also had mentioned that the API-adapter was designed for DoS rather than a polite API-consumer and maybe they should rewrite it to be less greedy and maybe also use ranges instead of crapping out a new request per row in in a response and stitching it together again on their end.

A few weeks later they got it running and it was brutal, but our machines could take the load so we didn't think more about it. Later I heard they got performance issues on their end and had to do as I had suggested anyway.

Be polite and pay attention to detail when you integrate with protocols and API:s. At best you'll be a nuisance if you don't, but many will just block you permanently.

eviks · on May 28, 2024

Thank you! this is a more interesting comment than the pissing example

cess11 · on May 28, 2024

I'm happy you liked it. Didn't think of it as hypothetical, since it happens a lot.

I'm a simple person, I often prefer the succinct, crude analogy over telling a story until asked or provoked into telling one.

0x445442 · on May 28, 2024

I liked your story AND your analogy.

michaelt · on May 28, 2024

> Are you really telling me you are getting enough RSS feed requests to put a dent on your tech stack?

Both RSS feed providers and reader maintainers believe that RSS isn't dead, it's just pining for the fjords.

They've got to keep their readers efficient ready for RSS to rise again - much like Christians have to resist the temptation of sin ready for the second coming of Christ.

/s

oliwarner · on May 28, 2024

I love RSS but this approach worries too much.

It's static content, at least as static as the rest of your content, so write once and let one of a dozen CDNs cache it for you for free. They will do a good job of setting cache headers for you. If people want to ignore them and honk on it from their crappy readers, that's on them.

Even if you're personally serving it from your kitchen toaster, it's static content. You need a significant number of bad actors for this to be a problem.

Moreover, if this post is anything to go by, treating a bad reader with a logical hurdle and expecting a sane result is baffling.

It's not worth this much headspace.

rakoo · on May 28, 2024

I don't like this approach: it means that instead of fixing the problem, we just put the problem under the rug and let a third-party centralize always more of our communications, giving them absurd power over our infrastructure. That's not an engineering solution, that's a politician solution.

oliwarner · on May 28, 2024

The reason I tolerate the power over our infrastructure is that it's completely escapable.

Building and caching my blog can be done well and freely by a dozen services, and it can easily fall back to me on my server if all else fails.

If we were talking about edge functions and proprietary integrations, that might be a more significant issue, but static websites are insanely easy to push off to CDN services.

rakoo · on May 28, 2024

When so many replies say "just use cloudflare" I disagree: it means our way of resolving this problem is to rely on machinery that costs millions if not tens of millions to build. That's like saying it's ok to build cities and our lives around cars because you can totally change your car provider at any time and they all have your best interest in mind.

oliwarner · on May 28, 2024

Depends on the layer we're talking about. You can replicate CI/CD with simple scripts and git hooks, but hosting your site from ~300 locations is going to take some money.

I'm not advocating Cloudflare because they're doing something special, it's because they're doing something utterly menial, for free and better than any one person or small company could.

It's absolutely right be be cognisant of service dependencies but none exist here. They're just making a dull job easier.

cqqxo4zV46cp · on May 28, 2024

How about arguing against the point being made instead of your analogy, which presents a scenario that does not at all match the original one. Who said anything about having your best interest in mind? Cosplay as a lone wolf all you want, but the reality is that people, especially the sort of people that know how to configure a CDN, wouldn’t last a minute without the support of others, delivered through capitalism or otherwise.

rakoo · on May 28, 2024

Yes but talking about not solving the underlying problem, just putting it all behind a third-party that is known to have issues , which is not just malevolence because it's bound to happen at its size, is the point: you want to solve the issue of how to be efficient, how to make it hostable on consumer-grade machines and connection, you can't involve the magic billion-dollar intermediary. It just moves the problem to another place and it makes you depend on something you can never control. That's not really interesting for hosting your own website.

1oooqooq · on May 28, 2024

Wow! you can get the exact same innovative service as geocities! that's so cool and proves we are not stuck or anything. yeah edge functions or god fordid hosting something on your network to the world is too professional.

kkfx · on May 28, 2024

Are you sure it's static? How many regularly update the same article after having published it, just for instance?

kevincox · on May 28, 2024

It's "mostly static". I just tell my CDN to cache HTML for 10min. This makes the origin server load trivial but still makes updates go live fairly quickly.

soapdog · on May 28, 2024

Exactly. Mine is not static for example, but I’m a low traffic blog and it is not a problem yet.

oliwarner · on May 28, 2024

I do. But that still falls well into the umbrella of WORM for me.

In practice, my SSG generates my RSS at the same time as my website and it's all running under Cloudflare Pages so it's just deployed and I don't have to think about any of it past the git commit.

I think that's what I'm trying to get at here. If you're dynamically generating RSS for a dynamic blog today, you've made some significant design errors.

cqqxo4zV46cp · on May 28, 2024

No they haven’t. They just may have not yet optimised for a problem that they’re not facing. Most websites are very very low-traffic. A world exists outside of web scale. Sheesh.

madeofpalk · on May 28, 2024

This article is clearly about clients requesting a URL every 5 minutes is too much for them. They have this problem.

ghusbands · on May 28, 2024

> There's another one which hits every 2 minutes without fail [...] it's the one feed user-agent I've had to block outright. It doesn't stop requesting things [...] Clearly, I need to use a bigger hammer.

I'd be tempted to go the opposite direction and give it articles that seem new, every time, called "Your feed reader is very broken and you should use a different one".

boricj · on May 28, 2024

Just give it a custom feed with only one article titled "Your feed reader sucks" explaining why they get this. Maybe even generate the contents of this article dynamically to provide diagnostic messages.

That way, badly-behaved clients can't use their broken RSS to receive actual articles until they fix it and they get to know why exactly.

kevincox · on May 28, 2024

I checked what my feed reader does. It seems that it typically gets 304s (I send both If-None-Match and If-Modified-Since) but still about half of the requests get 429s.

So the request pattern basically looks like:

1. Get 304.

2. Wait 5min.

3. Get 429.

4. Error backoff evenly distributed between 30-60min

5. Go back to step 1.

So I guess 5min is "too fast" even for conditional requests. However the reason my reader picks 5min is:

1. There is no caching headers to suggest the author's preference.

2. Conditional requests are supported.

3. The feed is fast.

4. The feed is popular.

5. The feed is active (posts every week or so)

Sure, for her feed it could probably be lower but how is a robot supposed to know that? I would highly recommend setting a cache header. That gives an automatic signal of when the last fetch should be considered stale. I'll admit that many feed readers just use a fixed schedule, but many will still use a library that accidentally caches requests (maybe the browser extension readers that she was complaining about would?). For my feed reader we won't poll more often than your cache header (Cache-Control or Expires will both work) unless it is more than 24h in which case we will poll no more frequently than daily. If you just post blog posts about it it is never going to change, it is better to start pushing for broader support of an actual protocol that could be implemented.

Another great option is supporting WebSub. It is easy to get set up with a public hub and then readers that support it will poll very rarely. (Mine will poll weekly.)

NoboruWataya · on May 28, 2024

Granted I do not have any blog feed, much less a reasonably popular one like Rachel by the Bay, but I would have thought that serving simple static content like an RSS feed is quite low-cost such that it almost doesn't matter how much people are querying, no? I'd be quite surprised if the author's server was genuinely struggling under the weight of these requests but happy to be corrected.

boricj · on May 28, 2024

Rachel by the Bay's atom.xml file is about 544 KiB (which for some reason seems to be always sent uncompressed regardless of the "Accept-Encoding" header, but I digress). It's by far the most requested file, with every subscriber's feed reader polling it periodically. For example, Feeder's default polling frequency is ten minutes, but some might poll more frequently.

That file is served over HTTPS, so every response must be encrypted individually. If it's cached by the client, that's a 304 and a couple hundred bytes of response headers to encrypt. If it's not, that's a 200 and a couple hundred thousand bytes to encrypt.

Now imagine if a garbage reader doesn't cache the feed and requests the full file every time, like Tiny Tiny RSS tends to do in my access logs. Multiply that by the number of subscribers using these busted readers. Multiply that by the number of requests they do each day.

That's a whole lot of load on the server's CPU wasted for nothing.

rfl890 · on May 28, 2024

Serving a kilobytes static file over HTTPS at around ~1k rps shouldn't be too hard for a modern server in the year of our lord 2024 right?

boricj · on May 28, 2024

These readers download 544 KiB per request. At 1k rps from badly behaved readers, that's 544 MiB of uncached responses to encrypt and upload per second. That's enough to saturate a gigabit Ethernet link five times over.

The server also has to process the useful requests on top of that.

wolpoli · on May 28, 2024

I was surprised to hear that they got authors of RSS readers to get in touch and fix issues. Last time I looked for an RSS reader for Windows, it was a graveyard of abandoned software.

manuelmoreale · on May 28, 2024

Plenty of very actively maintained projects at least on the Apple side of things.

People are absolutely still using RSS, the tech is very much not dead.

Zecc · on May 28, 2024

The irony of the situation is this post might incentivize new people to subscribe to her feed. I know I'm one of them.

codetrotter · on May 28, 2024

Hopefully taking the time to investigate how the reader works and pick one that behaves nicely.

FinnKuhn · on May 28, 2024

Is there a easy way to see how your reader works?

NoboruWataya · on May 28, 2024

Any reader worth using should have a settings or preferences section where poll interval can be configured, which should address the main issue in the article. Other problems like incorrect If-Modified-Since headers are probably not easily discernable without looking at source code or pointing the reader to your own page and checking the logs.

FinnKuhn · on May 28, 2024

I am using inoreader (cloudbased), which seems to adjust its polling rate depending on how frequently the feed was updated in the past and how many users are following it. This seems to be between 10 minutes for things like the verge and 60+ minutes for smaller feeds, but I am not sure, if it also takes into account rss header information. The Feed of the blog of this post seems to be refreshed every 6+ hours, but idk why that is.

mrighele · on May 28, 2024

You cannot hope that consumers are well-behaved on the Internet, it has never been the case unfortunately, you have to act defensively.

If the issue is the processing power because the content served is dynamically generated, cache the content. Even with a server running on ESP32, just a few second should be enough (if the rest of the website to be statically generated)

If the issue is the bandwidth you may focus on big content (i.e. media) and return a 429 or use a CDN just for that (not worth for a feed in my opinion)

If the issue is neither, why bother ? Your time is precious, and you should spend it on more enjoyable things (that is, unless you find doing this enjoyable of course).

xyst · on May 28, 2024

On the topic of RSS feeds. I really do miss this way of consuming news.

I tried using a few RSS feeds from major news organizations (WaPo, NYT). All I get are the headlines and a link to the article. I am not sure if it has always been this way or not.

On the other hand, “rachelbythebay” blog posts can be viewed entirely in the RSS reader.

I am using “feedly” on iOS. Downside I see is that it does not properly display all of the formatting elements. Some examples:

- Bullets represented as just *.

- The “code” sections sort of sit awkwardly in the reader. No formatting applied.

- ~~In some of the code sections, it’s just awkwardly replaced by ellipses (…)~~ nvm, this is the authors writing style :)

Maybe that’s just a specific RSS reader issue?

Rudism · on May 28, 2024

I think the answer to your first point is easily explained by the fact that the major news orgs typically have paywalls that check the user's IP, cookies, and so forth to decide whether they're allowed to see the full article for free or if they have to pony up first. There's no way to really incorporate that functionality into RSS feeds so they're just pushing out the bare minimum as a Hail Mary to drive their dwindling RSS user base back to the website. Smaller personal blogs that don't rely on advertising or paid membership revenue to stay afloat are far more likely to provide full article content on their RSS feeds (since there's no reason not to).

Some feed readers have "scraping" support where they request each article's URL internally and figure out how (or can be configured with CSS selectors) to extract the article's text content and display it in the reader, though it can be pretty hit or miss whether that can work for a given site.

As for formatting, there are a couple possible explanations--the reader could be ignoring or mangling HTML tags in the content when it's displaying it, or the site itself may be generating its RSS feeds with mangled or missing formatting elements. In my experience both possibilities are equally likely.

daft_pink · on May 28, 2024

It’s cause Aaron Swartz isn’t around to fix RSS.

rasz · on May 28, 2024

>I did that way back then because browsers used to care about RSS and Atom, and they'd put that little yellow feed icon somewhere in the top bar when they spotted this sort of thing in a page. At least in the case of Firefox, you could click on it, and it would throw the target URL to a helper of your choice.

Still here in Vivaldi.

thaumasiotes · on May 28, 2024

> A fair number of people are sending conditional requests, but are doing it every 5 or 10 minutes. This is ridiculous. I don't write that often, and never have.

This objection might make more sense if the purpose of a feed reader was to do nothing except check rachelbythebay.com for updates.

boricj · on May 28, 2024

On my blog, my feed.xml is currently 62654 bytes compressed, 307726 bytes long uncompressed. The compressed version is served with gzip_static, so if the client sets the HTTP header "Accept-Encoding: gzip" then it's sent at no additional CPU cost to me.

Looking at my access logs, some RSS readers are well-behaved and get served 304s. Some of them however not only request feed.xml without caching, but they don't even set the HTTP header "Accept-Encoding: gzip". They download feed.xml every time and they download it uncompressed.

I'm hosting my blog on a feeble 5-year old Synology NAS located in my apartment. It's not seeing a lot of traffic so it's currently not a problem, but these RSS readers are wasting several orders of magnitude more of network bandwidth and CPU resources (gotta encrypt it for HTTPS) on my end than well-behaved RSS readers.

It's just plain embarrassing.

veeti · on May 28, 2024

I requested the feed in my browser _twice_ and already received a 429 with a Retry-After of 24 hours! Not only is this easy to trip by accident, the problem for a feed reader is that it can't know when the author publishes new posts. If a reader happens to scrape the blog 5 minutes before a new entry is posted, it would then take over 23 hours for it to update.

There are surely a lot of badly behaved clients out there and I sympathize with that, but the author's policies seem rather heavy handed and counterproductive to providing an open feed. God forbid their most dedicated readers receive the most timely updates.

masklinn · on May 28, 2024

> If a reader happens to scrape the blog 5 minutes before a new entry is posted, it would then take over 23 hours for it to update.

Given the update frequency and kind of content it does not seem much of an issue does it?

> the author's policies seem rather heavy handed and counterproductive to providing an open feed

It’s not though? “Open feed” is just that, it’s not a license to be a resource hog because you refuse to be a good citizen.

> God forbid their most dedicated readers receive the most timely updates.

Rachel does not post day to day life updates. “Most timely updates” is not a thing, it’s not a storm reporting feed. Latency is not a concern.

veeti · on May 28, 2024

Whether latency is a concern or not depends entirely on the end user controlling the feed reader and their expectations. For all you know, the developer of the reader may be facing complaints from a disgruntled user upset that the feed isn't updating, even though they can see a new blog post on the website. A lecture on HTTP 429 status codes, Retry-After headers and "being a good citizen" is unlikely to make them any happier.

You are free to not serve such requests, but as a blogger you might want to consider that such bot traffic serves your legitimate readers. It appears to me that the author is also rate limiting well behaved clients that are already trying to be respectful of resources using conditional requests.

simoncion · on May 28, 2024

> A lecture on HTTP 429 status codes, Retry-After headers and "being a good citizen" is unlikely to make them any happier.

Why install a lecture in the feed reader? Just have a banner/toast/status-bar-message/whatever that says "This feed will update in X minutes/hours/days.".

thaumasiotes · on May 28, 2024

Nothing makes the user happier than telling them they can't use the only functionality you provide.

If you're going to show that message, why are you a feed reader at all? Open the webpage and show the content.

simoncion · on June 1, 2024

I agree that not showing any message is generally preferable.

GP seemed fixated on showing a lecture, which I thought was strictly worse than just showing a short message.

> If you're going to show that message, why are you a feed reader at all?

What? The feed readers I used showed status messages. Status messages are nice, they give the operator a fighting chance to understand what's going on and fix things if they've gone wrong.

eviks · on May 28, 2024

> not a license to be a resource hog because you refuse to be a good citizen.

Which resources are you "hogging" by reloading a webpage twice?

quectophoton · on May 28, 2024

> I requested the feed in my browser _twice_ and already received a 429 with a Retry-After of 24 hours!

My guess: No `Cache-Control` response header.

checks

Yup, no `Cache-Control` response header. So she doesn't want more than one request per day, but also doesn't want to add this header to the responses:

    Cache-Control: public, max-age=86400, stale-if-error=86400

thaumasiotes · on May 28, 2024

Well, come on. It's your responsibility as a client to check for fresh content in a manner that takes her posting patterns into account. It's not her responsibility as a server to provide content in a manner that takes her posting patterns into account.

quectophoton · on May 28, 2024

Serious or not (sounded like sarcasm but one can never be too sure!), I'll elaborate on this point.

I think it's both parties' responsibility. Well, "responsibility" may be a strong word, but it depends what you want to achieve.

It's okay to return 429 if you receive more than one request per day. But if the goal is to not get those requests in the first place, you can make it way easier with that Cache-Control header.

A client might not support the Retry-After header; after all, it's not widely supported by web browsers, so I can see how some libraries might not even bother with adding that maintenance burden.

But Cache-Control is more widely supported, so it's more likely to be already implemented in client libraries (or even caching proxies).

If someone doesn't support Retry-After, and also ignores Cache-Control, by all means, return an error. It's probably a misconfiguration of their client library at best, or a broken client at worst.

I'm not saying one should bend backwards and put unrealistic effort into stuff like this; what I'm saying is, if the HTTP 429 thingy was an acceptable amount of effort, then the Cache-Control thing should be similar (actually way less) effort and time and should slightly reduce the amount of requests received.

pavel_lishin · on May 29, 2024

> the problem for a feed reader is that it can't know when the author publishes new posts.

It can certainly guess, based on that retry-after header, that she doesn't post more than once a day, and should respect that.

masklinn · on May 28, 2024

As far as I’m concerned it makes more sense if it’s checking even more stuff.

If I have 500 feeds in my reader, most of whom update weekly at most, I would assume the reader does something smarter that mindlessly launch 500 requests every 5 minutes.

vsnf · on May 28, 2024

> I would assume the reader does something smarter that mindlessly launch 500 requests every 5 minutes.

Aside from respecting any 429 headers, what is a reader supposed to do without custom update heuristics for each feed that would be considered more than mindless?

mariusor · on May 28, 2024

The least a client can do is respect the HTTP caching headers that Rachel mentions multiple times in her post. RSS has an additional ttl property that can be used.

spc476 · on May 28, 2024

Every feed I'm aware of contain multiple entries, so perhaps average the time between posts as a baseline to check? Perhaps add an upper limit to one/hour and a lower limit to one/week? That way, a person who starts posting more frequently will be checked more frequently, and one who slacks off is checked less frequently?

noncoml · on May 28, 2024

In general it’s not a good argument, as the 10 min polling interval technically controls the latency and has very little to do with the posting frequency or interval between posts.

supriyo-biswas · on May 28, 2024

There’s websub/PSHB[1], though I don’t know how widely it’s implemented.

[1] https://en.m.wikipedia.org/wiki/WebSub

onli · on May 28, 2024

I think that's the right answer. At least for server based readers it seemed to be supported last I checked, and it's the most efficient way to solve the feed deliver with minimal resources on the server.

komadori · on May 28, 2024

At least at the point you receive a 429, the Retry-After header provides some guidance as to how frequently the resource should be polled.

kevincox · on May 28, 2024

Exactly, unless she expects every reader to configure the polling interval for her blog this isn't a good solution.

I would highly recommend she sets `Cache-Control: max-age=3600` or something similar. That is at least a method that can automatically be applied. Sure, right now most readers just use fixed intervals, but further support for this is a standard that can be pushed slowly. Plus some readers probably use HTTP libraries that would automatically cache the response so some clients will accidentally have support.

renegat0x0 · on May 28, 2024

The Internet is dying. Everything shifts from standards toward managed corporate walled gardens. There is no place for RSS in the future. This is my personal opinion.

I use only RSS. That is how I obtain new information. I do not know personally anyone else doing that. I ping sources every hour, but I ping at least 400 sources.

From my sources none provided last-modified in headers: reddit, youtube, personal sites (In ff f12, network). My site supports it. It is nice also that you support it, but I doubt there is any impact of that in real world. Most of the attention goes through tiktok,youtube videos, through chrome browser. Either it is supported now, or it doesn't really matter if 40 dudes makes request every minute or so.

We should also provide clean title, description, in open graph protocol meta data, and yet not everybody does that.

We should also return correct HTTP status codes, and yet not everybody does that.

I am disenchanted with current state of the Internet, or maybe it was always a little bit pile of various things/garbage.

0x445442 · on May 28, 2024

+1 for shared sentiments. Do you have a blog post(s) that explains your setup with the processes you’ve described?

renegat0x0 · on May 28, 2024

I do not have any blog entry worth sharing. I am running an "ethic web scraper". I think I cannot speak about processes. I just may be lacking knowledge. I think it is more about "experience" rather than process.

Web crawling core is in file: https://github.com/rumca-js/Django-link-archive/blob/main/rs...

Some things more project specific are in https://github.com/rumca-js/Django-link-archive/blob/main/rs...

I know that there already are spiders, metadata processing packages for python, but I like having control over the process.

Old man yelling at the cloud. I hate also:

- blocking me with 403 because my user agent is not "mainstream". Why do I have to use chrome undetected to read some RSS feeds? Why can't I use third party clients? Contents can have adverts. I just want my own layout, buttons

- RSS feeds protected with cloudflare, so tools cannot read feeds easily

- not using, or outright blocking RSS functionality in wordpress. Some sites could be more open that way, but no. RSS feeds are closed/removed

- some sites have "/blog" location, but the main domain is empty, or nearly empty, or returns 404. Can I trust such location?

- when HTML meta data are not available. I like YouTube. It allows me to scrape metadata, but it protects video contents, and that is good

- weird redirects. Domain does not have any contents. Does not describe what it is. It just have javascript redirects. From main domain to some weird locations within the domain

- url shorteners, vanity links. You do not know where you will be transported. I understand they are counting sheep, but they sacrifice my security

- google returning links with syntax "https://www.google.com/url", not directly. Youtube does the same with syntax "https://www.youtube.com/redirect". For me again this is vulnerability

My ethic web scraper results are placed in: https://github.com/rumca-js/Internet-Places-Database.

0x445442 · on May 29, 2024

Thanks for the info.

orf · on May 28, 2024

So much effort for such specific pedantry.

I just added her feed to a third party proxy service. If the owner of the site doesn’t want to serve a very small and basically static text file in a way people expect then that is their prerogative.

But having to run their feed through a proxy to work around this doesn’t seem to be something they would like to encourage.

simoncion · on May 28, 2024

> If the owner of the site doesn’t want to serve a very small and basically static text file in a way people expect...

I'm an old programmer and sysadmin. Programmers and sysadmins are definitely the "target audience" for this blog.

I expect that if I request a resource on the web, and I get back a 429 with a "Retry-After" then:

- I WILL NOT get a new version of that resource until after that time has elapsed

- I MAY get banned for an unknowable-by-me period of time if I continue to attempt to request that resource, despite getting a 429.

orf · on May 28, 2024

Unfortunately with stuff like this Hyrum’s Law happens at a distance and in aggregate. And 429’s are clearly not part of that living contract.

The reality is that it’s not much effort to serve a small bit of plaintext data, and nobody really cares that much to even notice.

It’s unclear if the effort is worth it, for anything other than what amounts to “well ackchully the spec says I can do it so I’ll do it” pedantry.

simoncion · on May 28, 2024

PLONK

orf · on May 28, 2024

Wake me up when we throw a big wobbly about RSS readers not correctly handling 418 status codes.