Personally, this feels like the direction scraping should move into. From defining how to extract, to defining what to extract. But we're nowhere near that (yet).
A few other thoughts from someone who did his best to implement something similar:
1) I'm afraid this is not even close to cost-effective yet. One CSS rule vs. a whole LLM. A first step could be moving the LLM to the client side, reducing costs and latency.
2) As with every other LLM-based approach so far, this will just hallucinate results if it's not able to scrape the desired information.
3) I feel that providing the model with a few examples could be highly beneficial, e.g. /person1.html -> name: Peter, /person2.html -> name: Janet. When doing this, I tried my best at defining meaningful interfaces.
4) Scraping has more edge-cases than one can imagine. One example being nested lists or dicts or mixes thereof. See the test cases in my repo. This is where many libraries/services already fail.
If anyone wants to check out my (statistical) attempt to automatically build a scraper by defining just the desired results:
https://github.com/lorey/mlscraper
I was most worried about #2 but surprised how much temperature seems to have gotten that under control in my cases. The author added a HallucinationChecker for this but said on Mastodon he hasn't found many real-world cases to test it with yet.
Regarding 3 & 4:
Definitely take a look at the existing examples in the docs, I was particularly surprised at how well it handled nested dicts/etc. (not to say that there aren't tons of cases it won't handle, GPT-4 is just astonishingly good at this task)
Your project looks very cool too btw! I'll have to give it a shot.
This seems like part of the problem we're always complaining about where hardware is getting better and better but software is getting more and more bloated so the performance actually goes down.
Yeah seems like it would make way more sense to have an LLM output the CSS rules. Or maybe output something slightly more powerful, but still cheap to compute.
A few other thoughts from someone who did his best to implement something similar:
1) I'm afraid this is not even close to cost-effective yet. One CSS rule vs. a whole LLM. A first step could be moving the LLM to the client side, reducing costs and latency.
2) As with every other LLM-based approach so far, this will just hallucinate results if it's not able to scrape the desired information.
3) I feel that providing the model with a few examples could be highly beneficial, e.g. /person1.html -> name: Peter, /person2.html -> name: Janet. When doing this, I tried my best at defining meaningful interfaces.
4) Scraping has more edge-cases than one can imagine. One example being nested lists or dicts or mixes thereof. See the test cases in my repo. This is where many libraries/services already fail.
If anyone wants to check out my (statistical) attempt to automatically build a scraper by defining just the desired results: https://github.com/lorey/mlscraper