It's mind-boggling to me that Google didn't create a spec to work with special p...

wepple · 2024-04-17T21:48:33 1713390513

You’d have to trust that the data being dumped was 100% identical to the actual pages users would eventually see, or you could end up with very weird (including dangerous) behavior

Of course, I know that some version of this can and does occur with classic web scraping too, but that is an arms race that a search engine can win

kevindamm · 2024-04-17T23:57:38 1713398258

> I know that some version of this can and does occur with classic web scraping too, but that is an arms race that a search engine can win

Cloaked links and cloaked ads still happen on direct requests, too -- a search engine's crawlers come in a widely known IP range (or if they start using unknown or new IPs, they become known soon enough) so even spoofing the user agent of the bot isn't a reliable workaround.

I'd say the arms race is still escalating, though I've been out of that game for a little while I'm still rather sure of that.

kevincox · 2024-04-18T01:23:41 1713403421

You can just spot check a tiny fraction of the data to validate this big it doesn't match the the site gets blocked.

nitwit005 · 2024-04-17T21:25:48 1713389148

In the past Google just requested the HTML, and didn't bother with the javascript, so it was a simple document request.

People started serving up pages that required Javascript to show content, so they had cope with that. I'm sure it's dramatically more expensive for them as well.

pwdisswordfishc · 2024-04-18T04:50:10 1713415810

As long as it’s prohibitively expensive for their competition, I’m sure they don’t mind.

sorenjan · 2024-04-17T22:13:11 1713391991

Isn't that what WeSub (previously PubSubHubbub) is?

https://en.wikipedia.org/wiki/WebSub

kevincox · 2024-04-18T01:27:13 1713403633

I don't think Google uses this in a major way. It also probably isn't the best fit for this use case.

1. You can really only subscribe to particular URLs. So this would require millions of subscriptions. It would only make sense if most of your pages are changing every few days.

2. You need to also subscribe to feeds to fine new content.

ivanjermakov · 2024-04-17T20:22:12 1713385332

Also, isn't this illegal? Either Google respected rate limits (which I doubt) or it was a form of a DDoS attack.

dylan604 · 2024-04-17T20:40:32 1713386432

It's only illegal if you poor people do it. For us rich corporations, we do what we want

fasa99 · 2024-04-18T03:38:38 1713411518

Google is the kingmaker and basically a monopoly on search. If you're going to light up dollars to share with a bot, let it be with google, who on a lucky day might decide you are king (because you let them index your site) I would presume and assume google obeys the robots.txt mandate as well.

But I would agree it's a very outstanding and real problem that is YC-worthy - sharing structured webpage data with trusted partners in a generic and efficient way. I've heard about various AI companies that perform such data scraping and structuring with AI, forget the name - this is many notches in sophistication above a Selenium-headless type driver. If only html were made into a model-view-controller neatly and users were let to bring their own views & controllers.

nosecreek · 2024-04-17T21:08:30 1713388110

Are there specific laws that deal with rate limits? Honest question - I get that something too fast could be considered DDoS, but so long as it’s below a certain threshold wouldn’t it be okay (not sure how said threshold would be determined)?

germinator · 2024-04-17T23:43:07 1713397387

In the US, CFAA prohibits causing "damage", which includes "impairment to the integrity or availability" of data or systems. But as with many other things in law, it boils down to the court trying to assess your intent, whether you could've reasonably anticipated the outcome, and what that outcome ended up being.

There's no law that says "you can't send more than n packets per hour".

aeyes · 2024-04-17T21:07:52 1713388072

I guess they wanted to be crawled because the Google crawler is easy to identify/block/rate limit.

jedberg · 2024-04-18T01:56:59 1713405419

It was fine because we let them.

beeboobaa3 · 2024-04-17T21:44:48 1713390288

Google's entire business model is based on behavior that is now illegal.

hadlock · 2024-04-18T20:35:40 1713472540

Google implemented ARC for their partner sites, Advanced Rsomthing Cache, I believe. It was a way for google to consume popular websites and host them optimized for mobile (android) devices. It was all the rage for a while, because, I think, Google greatly favored ARC sites/pages for a while. Eventually they discontinued it, because, well Google.

paulddraper · 2024-04-17T23:30:48 1713396648

That syndicated data format may or may not be up-to-date.

spiderice · 2024-04-18T01:22:15 1713403335

If it’s specifically accessible to special partners, Google could just end their partnership if Reddit is serving up false information to them and go back to the old way of scraping.