Hacker News new | past | comments | ask | show | jobs | submit login

> If not, they tend to be trivial to scrape.

Not if the "web page" is just a skeleton that gets filled in by Javascript. Like, you know, pretty much every web app out there.




Not sure where you have been for the last 10 years but yes, even SPAs are trivial to scrape today. But even better, because many people build SPAs, they tend to be powered by APIs, so you can just use the API directly instead. But even if you can't, trivial to scrape even when flooded with JS magic.


> even SPAs are trivial to scrape today

How? (I'm asking about the case where there is no API.)


With something like Puppeteer [1]. That said, if we're now headed towards a canvas + WebAssembly world, things could get far more difficult.

[1] https://github.com/puppeteer/puppeteer


Ah, ok. Yes, if you can remote control a browser you can of course "scrape" anything the browser can load. (Although even here the puppeteer README says it can load server side rendered data. Not all single page web apps do that.) To me that isn't quite the same as having a separate program, independent of the browser, that is able to just load the data from the URL and then operate on it, which is what I'm used to seeing referred to as "scraping".


I think what you are suggesting is often done using browser extensions. It's obviously not independent of the browser, but it serves this use case where you want to extend existing apps adding interactive features.

My understanding of the word "scraping" is primarily something related to en mass automated data extraction from user interfaces.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: