Hacker News new | past | comments | ask | show | jobs | submit login

I follow some indie hackers online who are in the scraping space, such as BrowserBear and Scrapingbee, I wonder how they will fare with something like this. The only solace is that this is nondeterministic, but perhaps you can simply ask the API to create Python or JS code that is deterministic, instead.

More generally, I wonder how a lot of smaller startups will fare once OpenAI subsumes their product. Those who are running a product that's a thin wrapper on top of ChatGPT or the GPT API will find themselves at a loss once OpenAI opens up the capability to everyone. Perhaps SaaS with minor changes from the competition really were a zero-interest-rate phenomenon.

This is why it's important to have a moat. For example, I'm building a product that has some AI features (open source email (IMAP and OAuth2) / calendar API), but it would work just fine even without any of the AI parts, because the fundamental benefit is still useful for the end user. It's similar to Notion, people will still use Notion to organize their thoughts and documents even without their Notion AI feature.

Build products, not features. If you think you are the one selling pickaxes during the AI gold rush, you're mistaken; it's OpenAI who's selling the pickaxes (their API) to you who are actually the ones panning for gold (finding AI products to sell) instead.




Scraping using LLMs directly is going to be really quite slow and resource intensive, but obviously quicker to get setup and going. I can see it being useful for quick ad-hock scrapes, but as soon as you need to scrape 10s or 100s thousands of pages it will certainly be better to go the traditional route. Using LLM to write your scrapers though is a perfect use case for them.

To put it somewhat in context, the two types of scrapers currently are traditional http client based or headless browser based. The headless browsers being for more advanced sites, SPAs where there isn't any server side rendering.

However headless browser scraping is in the order of 10-100x more time consuming and resource intensive, even with careful blocking of unneeded resources (images, css). Wherever possible you want to avoid headless scraping. LLMs are going to be even slower than that.

Fortunately most sites that were client side rendering only are moving back towards have a server renderer, and they often even have a JSON blob of template context in the html for hydration. Makes your job much easier!


I did this for the first time yesterday. I wanted the links for ten specific tarot cards off this page[0]. Copied the source into ChatGPT, list the cards, get the result back.

I'm fast with Python scraping but for scraping one page ChatGPT was way, way faster. The biggest difference is it was quickly able to get the right links by context. The suit wasn't part of the link but was the header. In code I'd have to find that context and make it explicit.

It's a super simple html site, but I'm not exactly sure which direction that tips the balances.

[0]http://www.learntarot.com/cards.htm


These kind of one-shot examples are exactly where this hit for me. I was in the middle of some research when I saw him post this and it completely changed my approach to gathering the ad-hoc data I needed.


> Using LLM to write your scrapers though is a perfect use case for them.

Indeed... and they could periodically do an expensive LLM-powered scrape like this one and compare the results. That way they could figure out by themselves if any updates to the traditional scraper they've written are required.


I'd invite you to check out https://www.usedouble.com/, we use a combination of LLMs and traditional methods to scrape data and parse the data to answer your questions.

Sure, it may be more resource intensive, but it's not slow by any means. Our users process hundreds of rows in seconds.


Exactly, semantically understanding the website structure is only one challenge of many with web scraping:

* Ensuring data accuracy (avoiding hallucination, adapting to website changes, etc.)

* Handling large data volumes

* Managing proxy infrastructure

* Elements of RPA to automate scraping tasks like pagination, login, and form-filling

At https://kadoa.com, we are spending a lot of effort solving each of these points with custom engineering and fine-tuned LLM steps.

Extracting a few data records from a single page with GPT is quite easy. Reliably extracting 100k records from 10 different websites on a daily basis is a whole different beast :)


Frustrating the only option to learn more is to book a demo and things like the API documentation are dead ends: https://www.kadoa.com/kadoa-api

The landing page does not provide nearly enough information on how it works in practice. Is it automated or is custom code written for each site?


In this particular case, GPT can help you mostly with parsing the website but not with the most challenging part of web scraping which is not getting blocked. In this case, you still need a proxy. The value from using web scraping APIs is access to a proxy pool via REST API.


You're correct, a lot of people are mistaken in this AI gold rush, however they are also misunderstanding how weak their moat actually is and how much AI is going to impact that as well.

Notion does not have a good moat. The increase of AI usage isn't going to strengthen their moat, it's going to weaken it unless they introduce major changes and make it harder for people to transition content away from Notion.

There are a lot of middle men who are going to be shocked to find out how little people care about their layer when openAI can replace it entirely. You know that classic article about how everyone's biggest competitor is a spreadsheet? That spreadsheet just got a little bit smarter.


> perhaps you can simply ask the API to create Python or JS code that is deterministic, instead.

Had a conversation last week with a customer that did exactly that - spent 15 minutes in ChatGPT generating working Scrapy code. Neat to see people solve their own problem so easily but it doesn't yet erode our value.

I run https://simplescraper.io and a lot of value is integrations, scale, proxies, scheduling, UI, not-having-to-maintain-code etc.

More important than that though is time-saved. For many people, 15 minutes wrangling with ChatGPT will always remain less preferable than paying a few dollars and having everything Just Work.

AI is still a little too unreliable at extracting structured data from HTML, but excellent at auxiliary tasks like identifying randomized CSS selectors etc

This will change of course so the opportunity right now is one of arbitrage - use AI to improve your offering before it has a chance to subsume it.


For the reasons others have said I don't see it replacing 'traditional' scraping soon. But I am looking forward to it replacing current methods of extracting data from the scraped content.

I've been using Duckling [0] for extracting fuzzy dates and times from text. It does a good job but I needed a custom build with extra rules to make that into a great job. And that's just for dates, 1 of 13 dimensions supported. Being able to use an AI that handles them with better accuracy will be fantastic.

Does a specialised model trained to extract times and dates already exist? It's entity tagging but a specialised form (especially when dealing with historical documents where you may need Gregorian and Julian calendars).

[0] https://github.com/facebook/duckling


you’re spot on that A.I could commoditize indie hacking.

The problem with many indie hackers is that they just build products to have fun and try to make a quick buck.

They take a basic idea and run with it, adding one more competitor to an already jamed market. No serious research or vision. So they get some buzz in the community at launch, then it dies off and they move on to the next idea. Rinse and repeat.

Rarely do they take the time to, for example, interview customers to figure out a defensible MOAT that unlocks the next stage of growth.

Those that do though usually manage to build awesome businesses. For example the guy who built browserbear also runs bannerbear which is one of the top tools in his category.

They key is to not stop at « code a fun project in a weekend » and actually learn the other boring parts required to grow a legit business overtime.

Source: I’m an indie hacker


I agree Dago (by the way, I enjoy your memes on Twitter). I think too many IHers are just building small features rather than full fledged products. I mean, if they want to make a few k a month, I guess that's alright, but they shouldn't be surprised if they are disrupted easily by competitors and copycats.

A month or two ago, there was some drama (which I'm sure you've seen as well) about an IHer who found a copycat. I looked into it and it didn't seem like a copy at all, yet this person was complaining quite heavily about it. But I mean, it's the fundamental law of business, compete or die. If you can't compete, you're not fit to run your business, and others who can, will.


thanks for the meme appreciation :D.

Yeah I think some people confuse copycats with competitors:

- Copycats who just flat out copy your design / messaging / landing page: that's something to complain about

- Someone doing a product that solves a similar problem but build their own solution and design: that's perfectly normal and acceptable




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: