Scraping using LLMs directly is going to be really quite slow and resource intensive, but obviously quicker to get setup and going. I can see it being useful for quick ad-hock scrapes, but as soon as you need to scrape 10s or 100s thousands of pages it will certainly be better to go the traditional route. Using LLM to write your scrapers though is a perfect use case for them.
To put it somewhat in context, the two types of scrapers currently are traditional http client based or headless browser based. The headless browsers being for more advanced sites, SPAs where there isn't any server side rendering.
However headless browser scraping is in the order of 10-100x more time consuming and resource intensive, even with careful blocking of unneeded resources (images, css). Wherever possible you want to avoid headless scraping. LLMs are going to be even slower than that.
Fortunately most sites that were client side rendering only are moving back towards have a server renderer, and they often even have a JSON blob of template context in the html for hydration. Makes your job much easier!
I did this for the first time yesterday. I wanted the links for ten specific tarot cards off this page[0]. Copied the source into ChatGPT, list the cards, get the result back.
I'm fast with Python scraping but for scraping one page ChatGPT was way, way faster. The biggest difference is it was quickly able to get the right links by context. The suit wasn't part of the link but was the header. In code I'd have to find that context and make it explicit.
It's a super simple html site, but I'm not exactly sure which direction that tips the balances.
These kind of one-shot examples are exactly where this hit for me. I was in the middle of some research when I saw him post this and it completely changed my approach to gathering the ad-hoc data I needed.
> Using LLM to write your scrapers though is a perfect use case for them.
Indeed... and they could periodically do an expensive LLM-powered scrape like this one and compare the results. That way they could figure out by themselves if any updates to the traditional scraper they've written are required.
I'd invite you to check out https://www.usedouble.com/, we use a combination of LLMs and traditional methods to scrape data and parse the data to answer your questions.
Sure, it may be more resource intensive, but it's not slow by any means. Our users process hundreds of rows in seconds.
To put it somewhat in context, the two types of scrapers currently are traditional http client based or headless browser based. The headless browsers being for more advanced sites, SPAs where there isn't any server side rendering.
However headless browser scraping is in the order of 10-100x more time consuming and resource intensive, even with careful blocking of unneeded resources (images, css). Wherever possible you want to avoid headless scraping. LLMs are going to be even slower than that.
Fortunately most sites that were client side rendering only are moving back towards have a server renderer, and they often even have a JSON blob of template context in the html for hydration. Makes your job much easier!