As someone who has been doing the same thing recently, here's how I solved the i...

geysersam · on March 26, 2023

I haven't tried this myself yet. But I'm surprised you didn't find it beneficial to pass the raw HTML to the chatbot (potentially after some filtering). Did `innerText` give better results than `innerHTML`?

My intuition is that the structure information in the HTML would be useful to extract structured data.

puglr · on March 28, 2023

Great question. The problem with the raw HTML was token count. :)

A rather high percentage of pages are far too much for a GPT prompt!

elendee · on March 28, 2023

why oh why

puglr · on April 5, 2023

Heh, mostly as an experiment. I'd done a fair bit of scraping for some personal football apps over the past few years. Was curious about how GPT might be used when starting from first principles, as well as its abilities to solve specific challenges encountered with the traditional approach.