Hacker News new | past | comments | ask | show | jobs | submit login

This is a poor take. All the major LLM scrapers already run and execute JavaScript, Googlebot has been doing it for probably a decade.

Simple limits on runtime atop crypto mining from being too big of a problem.




And by making bots hit that limit, scrapers don't get access to the protected pages, so the system works.

Bots can either risk being turned into crypto miners, or risk not grabbing free data to train AIs on.


Real users also have a limit where they will close the tab.


"Googlebot has been doing it for probably a decade."

This is why Google developed a browser, turns out scraping the web requires one to pretty much develop a V8 engine, so why not publish it as a browser .


This is so obvious when you say it, but what an awesome insight.


Except it doesn't make sense. Why not just use Firefox. Or improve the JS engine of Firefox.

I reckon they made the browser to control the browser market.


> Why not just use Firefox.

The reason why Servo has existed (when it was still in Mozilla's care) was because on how deeply spagettified Gecko's code (sans IonMonkey) was, with the plan of replacing Gecko's components with Servo's.

Firefox's automation systems are now miles better but that's literally the combination of years of work to modularize Gecko, the partial replacement of Gecko's parts with Servo's (like Stylo: https://hacks.mozilla.org/2017/08/inside-a-super-fast-css-en...), and actively building the APIs despite the still-spagettified mess.


V8 was dramatically better than Firefox at the time. AFAIK, it was the first JS engine to take the approach of compiling repetitive JS to native assembly.

If it's true that V8 was used internally for Google's scraper before they even thought about Chrome, then it makes obvious sense why not. The other factor is the bureaucracy and difficulty of getting an open source project to refactor their entire code base around your own personal engine. Google had the money and resources to pay the best in the business to work on Chrome.


their browser is their scraper. what you see is what the scraper sees is what the ads look like.


"Why develop in-house software for the core application of the biggest company in the world at the time, worth more than 100B$. Why not just repurpose rinky dink open source browser as some kind of parser, bank our 100B$ business on some volunteers and a 501c3 NFP, that will play out well in a shareholder meeting and in trials when they ask us how we safeguard our software."


Why didn't they do that instead of just forking WebKit?


It's not quite that simple. I think that having that skillset and knowledge in house already probably led to it being feasible, but that's not why they did it. They created Chrome because it was in their best interests for rich web applications to run well.


You don't work anywhere near the as industry then, people have been grumbling about this for the whole 10 years now


... and the fact that even with a browser, content gated behind Macromedia Flash or ActiveX applets was / is not indexable is why Google pushed so hard to expand HTML5 capabilities.


Was it really a success though in that regard? HTML5 was great and all, but it never did replace Flash. Websites mainly just became more static. I suspect the lack of mobile integration had more to do with Flash dying than HTML5 getting better. It's a shame in some sense, because Flash was a lot of fun.


But it is the whole point of the article ? Big scrapers can hardly tell if the JS that takes their runtimes is a crypto miner or an anti-scrapping system, and so they will have to give up "useful" scrapping, so PoW might just work.


No they point is there's really advanced PoW challenges out there to prove you're not a bot (those websites that take >3s to fingerprint you are doing this!)

The idea is to abuse the abusers and if you suspect it's a bot change the PoW from a GPU/machine/die fingerprint computation to something like a few ticks of Monero or whatever the crypto of choice is this week.

Sounds useless, but you forget 0.5s of that on their farm x1e4 scraping nodes and you're into something.

The catch is not getting caught out by impacting the 0.1% of tor running anti ad "users" out there who will try and decompile your code when their personal chrome build fails to work. I say "users" because they will be visiting a non free site espousing their perceived right to be there, no different to a bot for someone paying the bills.


> Simple limits on runtime atop crypto mining from being too big of a problem.

If they put in a limit, you've won. You just make your site be above that limit, and the problem is gone.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: