The point was, you don't have to wait for JS to re-arrange the dom, sometimes it...

kbenson · on Aug 8, 2023

I think btown's point was sometimes what you're served is not just "the data you need" from that request, but a portion of the page that will be inserted, rather than built from raw data and inserted, so you need to parse the HTML in the response given since it's an HTML snippet.

It's still generally easier, because you don't have to worry about zeroing in on the right section of the page before you start pulling the data out of the HTML, but not quite as easy as getting a JSON structure.

ricardo81 · on Aug 8, 2023

I'd disagree on the easier. Instead of loading 100x assets per thing you want to scrape, 2x is way better for both parties.

But perhaps if there's xx endpoints for some inefficient reason, the browser DOM would be better.

kbenson · on Aug 8, 2023

I think you're still misunderstanding. Sometimes, sites haven't adopted a pure data-driven model, and when example.com/api/endpoint?productid=123 is requested it doesn't return JSON for the product with id 123, but instead returns a div or table row of HTML which has the data for that product already in it, which is then inserted directly where it's meant to be in the current page, rather then built into HTML from JSON and then inserted.

What I was saying is that method is not quite as easy as pure JSON to get data from, but still easier to parse and find the specific data for the specific item you're looking for, as it's a very small amount of markup all related to the entry in question.

My interpretation of btown's comment is along the same lines, that it's surprising how many sites still serve HTML snippets for dynamic pages.

btown · on Aug 8, 2023

So I have seen that indeed!

But also, some more modern sites with JSON API endpoints will have extremely bespoke session/auth/state management systems that make it difficult to create a request payload that will work without calculations done deep in the bowels of their client-side JS code. It can be much easier, if slower and more costly, to mimic a browser and listen to the equivalent of the Network tab, than to find out how to create valid payloads directly for the API endpoints.

josephg · on Aug 8, 2023

If you can see the request in the network tab, you can just right click - copy as curl and then replay the request from the command line and noodle with the request parameters that way. Works great!

ricardo81 · on Aug 8, 2023

Honestly, from prior experience any scraping requirements that require browser implementation tend to be due to captchas and anti-scraping measures, nothing to do with the data layout.

It's either in the DOM or in one or two other payloads.

Isn't this sort-of-why people hide themselves behind Cloudflare, to remove the lowest common denominators of scraping.

kbenson · on Aug 8, 2023

Yes. Sometimes you can see that there's a static (per session) header they add to each request, and all you have to do is find and record that header value (such as shimming addRequestHeader) and append it to your own requests from that context...

ricardo81 · on Aug 8, 2023

I get you, and agree, it's not as easy to scrape. And makes no sense to do it that way, for them, a scraper, a search engine, A.N other, or the user.

The old SSI includes of Apache would probably just be as efficient.

Reading the OP and comments it seems like a generational difference with a younger gen not appreciating server side generation in the same way.