This is interesting. How much difference is it (in cost, quality) by using this approach compared to taking a image capture of the page and then sending it off to a multi modal LLM?
Good question, I actually haven't tried it with the image capture approach. I'll give that a shot and see how it performs. I'm planning to try many different AI extractors, and see which performs best.
So far, I've done some un-scientific testing to compare text vs. HTML. Text is a lot more effective on a per-token basis, and therefore lower cost. However, some data is only available in HTML.