As someone who has been doing the same thing recently, here's how I solved the issue where the page content has to be in the initial HTML.
The first thing I did was fall back to a headless browser. Let it sit for 5 seconds to let the page render, then snatch the innerText.
But 5-10% of sites do a good job of showing you the door for being a robot.
I wanted to try and solve those cases by taking a screenshot of the page and using GPT-4 visual inputs, but when I got access I realized that 1) visual inputs aren't available yet and 2) holy crap is GPT-4 expensive.
So instead what I do is give a screenshot service the url, get back a full-page PNG, then I hand that off to GCP Cloud Vision to OCR it. The OCRed text then gets fed into GPT-3.5 like normal.
I haven't tried this myself yet. But I'm surprised you didn't find it beneficial to pass the raw HTML to the chatbot (potentially after some filtering). Did `innerText` give better results than `innerHTML`?
My intuition is that the structure information in the HTML would be useful to extract structured data.
Heh, mostly as an experiment. I'd done a fair bit of scraping for some personal football apps over the past few years. Was curious about how GPT might be used when starting from first principles, as well as its abilities to solve specific challenges encountered with the traditional approach.
The first thing I did was fall back to a headless browser. Let it sit for 5 seconds to let the page render, then snatch the innerText.
But 5-10% of sites do a good job of showing you the door for being a robot.
I wanted to try and solve those cases by taking a screenshot of the page and using GPT-4 visual inputs, but when I got access I realized that 1) visual inputs aren't available yet and 2) holy crap is GPT-4 expensive.
So instead what I do is give a screenshot service the url, get back a full-page PNG, then I hand that off to GCP Cloud Vision to OCR it. The OCRed text then gets fed into GPT-3.5 like normal.