Hacker News new | past | comments | ask | show | jobs | submit login

You’d have to trust that the data being dumped was 100% identical to the actual pages users would eventually see, or you could end up with very weird (including dangerous) behavior

Of course, I know that some version of this can and does occur with classic web scraping too, but that is an arms race that a search engine can win




> I know that some version of this can and does occur with classic web scraping too, but that is an arms race that a search engine can win

Cloaked links and cloaked ads still happen on direct requests, too -- a search engine's crawlers come in a widely known IP range (or if they start using unknown or new IPs, they become known soon enough) so even spoofing the user agent of the bot isn't a reliable workaround.

I'd say the arms race is still escalating, though I've been out of that game for a little while I'm still rather sure of that.


You can just spot check a tiny fraction of the data to validate this big it doesn't match the the site gets blocked.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: