I follow some indie hackers online who are in the scraping space, such as Browse...

samwillis · on March 25, 2023

Scraping using LLMs directly is going to be really quite slow and resource intensive, but obviously quicker to get setup and going. I can see it being useful for quick ad-hock scrapes, but as soon as you need to scrape 10s or 100s thousands of pages it will certainly be better to go the traditional route. Using LLM to write your scrapers though is a perfect use case for them.

To put it somewhat in context, the two types of scrapers currently are traditional http client based or headless browser based. The headless browsers being for more advanced sites, SPAs where there isn't any server side rendering.

However headless browser scraping is in the order of 10-100x more time consuming and resource intensive, even with careful blocking of unneeded resources (images, css). Wherever possible you want to avoid headless scraping. LLMs are going to be even slower than that.

Fortunately most sites that were client side rendering only are moving back towards have a server renderer, and they often even have a JSON blob of template context in the html for hydration. Makes your job much easier!

travisjungroth · on March 25, 2023

I did this for the first time yesterday. I wanted the links for ten specific tarot cards off this page[0]. Copied the source into ChatGPT, list the cards, get the result back.

I'm fast with Python scraping but for scraping one page ChatGPT was way, way faster. The biggest difference is it was quickly able to get the right links by context. The suit wasn't part of the link but was the header. In code I'd have to find that context and make it explicit.

It's a super simple html site, but I'm not exactly sure which direction that tips the balances.

[0]http://www.learntarot.com/cards.htm

tomberin · on March 25, 2023

These kind of one-shot examples are exactly where this hit for me. I was in the middle of some research when I saw him post this and it completely changed my approach to gathering the ad-hoc data I needed.

arbuge · on March 25, 2023

> Using LLM to write your scrapers though is a perfect use case for them.

Indeed... and they could periodically do an expensive LLM-powered scrape like this one and compare the results. That way they could figure out by themselves if any updates to the traditional scraper they've written are required.

geepytee · on March 25, 2023

I'd invite you to check out https://www.usedouble.com/, we use a combination of LLMs and traditional methods to scrape data and parse the data to answer your questions.

Sure, it may be more resource intensive, but it's not slow by any means. Our users process hundreds of rows in seconds.

hubraumhugo · on March 25, 2023

Exactly, semantically understanding the website structure is only one challenge of many with web scraping:

* Ensuring data accuracy (avoiding hallucination, adapting to website changes, etc.)

* Handling large data volumes

* Managing proxy infrastructure

* Elements of RPA to automate scraping tasks like pagination, login, and form-filling

At https://kadoa.com, we are spending a lot of effort solving each of these points with custom engineering and fine-tuned LLM steps.

Extracting a few data records from a single page with GPT is quite easy. Reliably extracting 100k records from 10 different websites on a daily basis is a whole different beast :)

ec109685 · on March 27, 2023

Frustrating the only option to learn more is to book a demo and things like the API documentation are dead ends: https://www.kadoa.com/kadoa-api

The landing page does not provide nearly enough information on how it works in practice. Is it automated or is custom code written for each site?

mateuszbuda · on March 25, 2023

In this particular case, GPT can help you mostly with parsing the website but not with the most challenging part of web scraping which is not getting blocked. In this case, you still need a proxy. The value from using web scraping APIs is access to a proxy pool via REST API.

waboremo · on March 25, 2023

You're correct, a lot of people are mistaken in this AI gold rush, however they are also misunderstanding how weak their moat actually is and how much AI is going to impact that as well.

Notion does not have a good moat. The increase of AI usage isn't going to strengthen their moat, it's going to weaken it unless they introduce major changes and make it harder for people to transition content away from Notion.

There are a lot of middle men who are going to be shocked to find out how little people care about their layer when openAI can replace it entirely. You know that classic article about how everyone's biggest competitor is a spreadsheet? That spreadsheet just got a little bit smarter.

welanes · on March 25, 2023

> perhaps you can simply ask the API to create Python or JS code that is deterministic, instead.

Had a conversation last week with a customer that did exactly that - spent 15 minutes in ChatGPT generating working Scrapy code. Neat to see people solve their own problem so easily but it doesn't yet erode our value.

I run https://simplescraper.io and a lot of value is integrations, scale, proxies, scheduling, UI, not-having-to-maintain-code etc.

More important than that though is time-saved. For many people, 15 minutes wrangling with ChatGPT will always remain less preferable than paying a few dollars and having everything Just Work.

AI is still a little too unreliable at extracting structured data from HTML, but excellent at auxiliary tasks like identifying randomized CSS selectors etc

This will change of course so the opportunity right now is one of arbitrage - use AI to improve your offering before it has a chance to subsume it.

pbowyer · on March 26, 2023

For the reasons others have said I don't see it replacing 'traditional' scraping soon. But I am looking forward to it replacing current methods of extracting data from the scraped content.

I've been using Duckling [0] for extracting fuzzy dates and times from text. It does a good job but I needed a custom build with extra rules to make that into a great job. And that's just for dates, 1 of 13 dimensions supported. Being able to use an AI that handles them with better accuracy will be fantastic.

Does a specialised model trained to extract times and dates already exist? It's entity tagging but a specialised form (especially when dealing with historical documents where you may need Gregorian and Julian calendars).

[0] https://github.com/facebook/duckling

dagorenouf · on March 26, 2023

you’re spot on that A.I could commoditize indie hacking.

The problem with many indie hackers is that they just build products to have fun and try to make a quick buck.

They take a basic idea and run with it, adding one more competitor to an already jamed market. No serious research or vision. So they get some buzz in the community at launch, then it dies off and they move on to the next idea. Rinse and repeat.

Rarely do they take the time to, for example, interview customers to figure out a defensible MOAT that unlocks the next stage of growth.

Those that do though usually manage to build awesome businesses. For example the guy who built browserbear also runs bannerbear which is one of the top tools in his category.

They key is to not stop at « code a fun project in a weekend » and actually learn the other boring parts required to grow a legit business overtime.

Source: I’m an indie hacker

satvikpendem · on March 27, 2023

I agree Dago (by the way, I enjoy your memes on Twitter). I think too many IHers are just building small features rather than full fledged products. I mean, if they want to make a few k a month, I guess that's alright, but they shouldn't be surprised if they are disrupted easily by competitors and copycats.

A month or two ago, there was some drama (which I'm sure you've seen as well) about an IHer who found a copycat. I looked into it and it didn't seem like a copy at all, yet this person was complaining quite heavily about it. But I mean, it's the fundamental law of business, compete or die. If you can't compete, you're not fit to run your business, and others who can, will.

dagorenouf · on April 4, 2023

thanks for the meme appreciation :D.

Yeah I think some people confuse copycats with competitors:

- Copycats who just flat out copy your design / messaging / landing page: that's something to complain about

- Someone doing a product that solves a similar problem but build their own solution and design: that's perfectly normal and acceptable