It's mind-boggling to me that Google didn't create a spec to work with special partners whereby they could get syndicated data in a format that was easy to ingest, and easy for their partners to produce. Lighting dollars on fire just to serve up pages to Googlebot, when you could just periodically dump a journal of updates to Google, is just crazy imo.
On edit: if only there were some way to do some kind of really simple syndication of your data.
You’d have to trust that the data being dumped was 100% identical to the actual pages users would eventually see, or you could end up with very weird (including dangerous) behavior
Of course, I know that some version of this can and does occur with classic web scraping too, but that is an arms race that a search engine can win
> I know that some version of this can and does occur with classic web scraping too, but that is an arms race that a search engine can win
Cloaked links and cloaked ads still happen on direct requests, too -- a search engine's crawlers come in a widely known IP range (or if they start using unknown or new IPs, they become known soon enough) so even spoofing the user agent of the bot isn't a reliable workaround.
I'd say the arms race is still escalating, though I've been out of that game for a little while I'm still rather sure of that.
In the past Google just requested the HTML, and didn't bother with the javascript, so it was a simple document request.
People started serving up pages that required Javascript to show content, so they had cope with that. I'm sure it's dramatically more expensive for them as well.
I don't think Google uses this in a major way. It also probably isn't the best fit for this use case.
1. You can really only subscribe to particular URLs. So this would require millions of subscriptions. It would only make sense if most of your pages are changing every few days.
2. You need to also subscribe to feeds to fine new content.
Google is the kingmaker and basically a monopoly on search.
If you're going to light up dollars to share with a bot, let it be with google, who on a lucky day might decide you are king (because you let them index your site)
I would presume and assume google obeys the robots.txt mandate as well.
But I would agree it's a very outstanding and real problem that is YC-worthy - sharing structured webpage data with trusted partners in a generic and efficient way. I've heard about various AI companies that perform such data scraping and structuring with AI, forget the name - this is many notches in sophistication above a Selenium-headless type driver. If only html were made into a model-view-controller neatly and users were let to bring their own views & controllers.
Are there specific laws that deal with rate limits? Honest question - I get that something too fast could be considered DDoS, but so long as it’s below a certain threshold wouldn’t it be okay (not sure how said threshold would be determined)?
In the US, CFAA prohibits causing "damage", which includes "impairment to the integrity or availability" of data or systems. But as with many other things in law, it boils down to the court trying to assess your intent, whether you could've reasonably anticipated the outcome, and what that outcome ended up being.
There's no law that says "you can't send more than n packets per hour".
Google implemented ARC for their partner sites, Advanced Rsomthing Cache, I believe. It was a way for google to consume popular websites and host them optimized for mobile (android) devices. It was all the rage for a while, because, I think, Google greatly favored ARC sites/pages for a while. Eventually they discontinued it, because, well Google.
If it’s specifically accessible to special partners, Google could just end their partnership if Reddit is serving up false information to them and go back to the old way of scraping.
On edit: if only there were some way to do some kind of really simple syndication of your data.