Show HN: I'm making an AI scraper called FetchFox

jackienotchan · 2024-09-04T05:39:59.000000Z

You have LinkedIn and Twitter examples, where you're very likely violating their TOS as they prohibit any scraping.

I also assume you don't check the robots.txt of websites?

I'm all for automating tedious work, but with all this (mostly AI-related) scraping, things are getting out of hand and creating a lot of headaches for developers maintaining heavily scraped sites.

- "Multiple AI companies bypassing web standard to scrape publisher sites" https://news.ycombinator.com/item?id=40750182

marcell · 2024-09-04T06:36:46.000000Z

Scraping is semi-controversial, but in this case it's just a user with a Chrome extension visiting the site. LinkedIn has lots and lots of shady patterns around showing different results to Google Bot vs. regular users to encourage logged in sessions. Many other sites like Pinterest and Twitter/X employ similar annoying patterns.

Imo, users should be allowed to use automation tools to access websites and collect data. Most of these sites thrive off of user generated content anyways, for example Reddit is built on UGC. Why shouldn't people be able to scrape it?

kaoD · 2024-09-04T07:38:56.000000Z

In hopes of saving someone a search: UGC = User Generated Content.

firtoz · 2024-09-04T08:29:39.000000Z

If let's say I built an extension that allows people to scrape things on demand and the extension sends that data also to my servers, removing PII in the process, would that be allowed?

padolsey · 2024-09-04T07:19:48.000000Z

Technically it's acting on behalf of a proactive user in Chrome so IMHO is non-"robotic". But heh tbf this was also the excuse of Perplexity where they argued they are a legitimate non-robotic-user-agent (thus don't need to respect robots.txt) because they only make requests at the time of a user query. We need a new way of understanding what it even means to be a legitimate human user-agent. The presence of AIs as client-side catalysts will only grow.

silvanocerza · 2024-09-04T10:53:42.000000Z

Scraping is not illegal. Note that this decision is before the AI craze.

https://www.forbes.com/sites/zacharysmith/2022/04/18/scrapin...

aguaviva · 2024-09-04T16:13:38.000000Z

The parent didn't say the scraping was "illegal", but that it violated ToS.

These are entirely different things. The upshot of the proceedings is that while the courts ruled there wasn't sufficient for an injunction to stop the scraping, it was nonetheless still injurious to the plaintiff and had breached their User Agreement -- thus allowing LinkedIn to compel hiQ towards a settlement.

From Wikipedia:

   The 9th Circuit ruled that hiQ had the right to do web scraping.[1][2][3] However, the Supreme Court, based on its Van Buren v. United States decision,[4] vacated the decision and remanded the case for further review in June 2021. In a second ruling in April 2022 the Ninth Circuit affirmed its decision.[5][6] In November 2022 the U.S. District Court for the Northern District of California ruled that hiQ had breached LinkedIn's User Agreement and a settlement agreement was reached between the two parties.[7]

robofanatic · 2024-09-05T14:25:55.000000Z

I see scraping to be equivalent to a Cherry Tree Shaking machine :-) If you are authorized to pick cherries from a tree then why not use a tree shaker and do the job in seconds but yeah make sure you don't kill the tree in the process. Also the tree owner must have right to deny you from using the tree shaker machine.

https://www.youtube.com/watch?v=miBk0lyMBC0

siamese_puff · 2024-09-07T13:05:37.000000Z

Hackers have gotten so boring these days. A fellow hacker builds a fun tool and we first gravitate toward the legal implications?

churros_train · 2024-09-04T05:34:23.000000Z

I am really curious how do people actually evaluate scrapers? There are so many options and I am dizzy just trying to read them...

Also wondering how does the OP think about comparing themselves and standing out in the marketplace of seemingly bazillion options

marcell · 2024-09-04T06:40:16.000000Z

Try it out and let me know if you like it :). If there's a bug I'll fix it!

More specifically, FetchFox is targeting a specific niche of scraping. It focuses on small scale scraping, like dozens or a few hundred pages. This is partly because, as a Chrome extension, it can only scrape what the user's internet connection can support. You can't scrape thousands or millions of pages on a residential connection.

But a separate reason is, I think that LLM's open up a new market and use case for scraping. FetchFox lets anyone scrape without coding knowledge. Imagine you're doing a research project, and want data from 100 websites. FetchFox makes that easy, whereas with traditional scraping you would have needed coding knowledge to scrape those sites.

As an example, I used FetchFox to research political bias in the media. I was able to get data from hundreds of articles without writing a line of code: https://ortutay.substack.com/p/analyzing-media-bias-with-ai . I think this tool could be used by many non-technical people in the same way.

Malidir · 2024-09-04T08:15:10.000000Z

why would these non technical people even use a tool when they can go to a internet connected llm and say 'go to this site and get this info'

e.g. Mr John Smith is a journalist, find his ten most recent articles via locating his personal website, news sites and social media.

so wondering if your tool will be obsolete in a years time?

marcell · 2024-09-04T08:18:59.000000Z

I can’t predict the future but right now you can use ChatGPT to scrape maybe one or two pages at a time, but it’s harder to do a dozen or a few hundred. So that’s the niche I’m going for.

Try it out and see if you like it. Curious if you think it’s better or worse than ChatGPT for scraping dozens of pages.

churros_train · 2024-09-04T07:07:07.000000Z

Ah thats really interesting! How do you evaluate large scale cloud scraping services, since its operations are entirely hidden from you?

Personally I am looking into options in this area, are you planning to offer a cloud based version of this at some point/could you tell which existing ones are good if not?

marcell · 2024-09-04T08:17:27.000000Z

If you don’t really care about the “morals” of the proxy services, just sign up for a few and see which one is reliable and has good cost. I used Luminati before, they have rebranded to Bright Data. Another is Oxy Labs.

I do want to offer a cloud version. If it’s something you’d be interested in, please email and maybe you’d be a good early user for it. You’d get 1:1 attention as one of our early users. Email is on the site and in my profile.

mhuffman · 2024-09-04T09:25:16.000000Z

>I am really curious how do people actually evaluate scrapers?

It is not easy to evaluate scrapers unless you have had to deal with lots of poorly written websites in your life. Just using it on a few highly structured well-maintained sites can be impressive but if you are using it to acquire data from many websites or large websites things get hairy fast.

Most scrapers today are some combination of extracting xpaths and reducing them to the loosest common form, parsing semantic (or easy to identify, like links) or highly structured content that has discoverable patterns, and LLMs.

The actual best way to scrape a site is to determine if they are populating the data you want with API calls and replicate those. People are usually more reluctant to completely change back-end code but will make subtle breaking changes (to your scraper) to front-end code all the time. For example small structural or naming changes. This has become more problematic since people have been moving to more SSR and semi-SSR injection. There can also be a problem with discovering all the pages on a site if it doesn't have poorly designed or implemented paging or search.

Some of the worst sites to scrape are large WP sites that have obviously been through a few developers. If you really want to test a scraper find some of those and they will put it to the test.

Cloudflare is another issue. Not necessarily an issue with this plugin, but because so many sites use it, you typically have to spin up multiple automated headless browsers using residential proxies for any type of large-scale scraping.

Some things that LLM does shine at related to scraping is interpreting freeform addresses, custom tables (meaning no TR,TD, but just divs and CSS to make it look like a table, and lists that are also just styled divs. Often there are no tags, attributes, keywords, or generalized xpath that will help you depending on how the developer put it together.

Surprisingly there is a pretty old library from Microsoft of all places called Prose if you can still find it (they keep updating but using the same name for different things and trying to inject AI) that is really good at pattern matching and prediction and is small, fast, and free and generally great at building a generalized scraper. Only drawback is I believe the only one I could find was .NET at the time.

smcleod · 2024-09-04T08:02:56.000000Z

Out of interest - why is it called FetchFox - but it doesn't work on Firefox?

marcell · 2024-09-04T08:19:33.000000Z

FireFox versión coming soon!

Malidir · 2024-09-04T08:09:05.000000Z

Indeed!

konata390 · 2024-09-04T19:36:45.000000Z

A bit off-topic, but why do people still use the GIF format? The "example-hn.gif" is 8.5MB, for 45 seconds of pretty stuttery video. I converted it to a similar looking VP9 video, and it was only 1.5MB, and with AV1 I got it down to 550KB with basically lossless quality.

stevenicr · 2024-09-05T15:52:51.000000Z

from what I have noticed - many of the "video instead of gif" sites / apps disable (right click-save , long press-save) saving.. so many users prefer and use gifs because they are easy to save and share and many times the video version is impossible.

CalRobert · 2024-09-04T06:58:09.000000Z

Of all the names to give something you built for chrome instead of Firefox…

rchaud · 2024-09-04T19:53:00.000000Z

Fitting considering it uses "Open"AI for the heavy lifting.

marcell · 2024-09-04T07:04:46.000000Z

I know! I used ChatGPT to get ideas for the name + logo, and didn't realize the issue until it was too late

netsharc · 2024-09-04T08:09:26.000000Z

Too late... how?

thegabriele · 2024-09-04T09:20:52.000000Z

It's not too late...

bearjaws · 2024-09-04T02:35:23.000000Z

This is really cool, I'd just go ahead and double check that max spend limit on your OAI key before going to bed :)

This kind of stuff gets expensive fast.

marcell · 2024-09-04T06:40:40.000000Z

Thanks for the kind words! I have a spend limit in place :)

benrules2 · 2024-09-04T13:23:45.000000Z

This is a really cool tool. Have been playing with similar scraping capabilities, so appreciate you sharing the source code as well. People who are saying "loads of scraping tools already exist" have likely not suffered through the current state of the art too, as heuristic based approaches absolutely pale in comparison to what an LLM can extract.

trog · 2024-09-04T04:11:00.000000Z

Would love something like this that allows users to trivially turn sites like Facebook/Twitter into RSS feeds. I'm sure this kinda thing is a useful stepping stone to doing that.

churros_train · 2024-09-04T05:35:37.000000Z

My impression is that facebook and twitter have really strong anti scraping measures. Is that wrong? And is there any reliable scraping services that can actually do scraping of those large companies' sites at a reasonable cost?

marcell · 2024-09-04T06:47:34.000000Z

One thing to note about FetchFox: it runs as a Chrome extension. This means it has a different interaction with anti-scraping measures than cloud based tools.

For one thing, many (most? all?) large sites ban Amazon IP's from accessing their websites. This is not a problem for FetchFox.

Also, with FetchFox, you can scrape a logged in session without exposing any sensitive information. Your login tokens/passwords are never exposed to any 3rd party proxy like they would be with cloud scraping. And if you use your own OpenAI API key, the extension developer (me) never sees any of the activity in your scraping. OpenAI does see it, however.

> And is there any reliable scraping services that can actually do scraping of those large companies' sites at a reasonable cost?

FetchFox :).

But besides that, the gold standard for scraping is proxied mobile IP requests. There are services that let you make requests which appear to come from a mobile IP address. These are very hard for big sites to block, because mobile providers aggregate many customer requests together.

The downside is mainly cost. Also, the providers in this space can be semi-sketchy, depending on how they get the proxy bandwidth. Some employ spyware, or embed proxies into mobile games without user knowledge/consent. Beware what you're getting into.

marcell · 2024-09-04T06:41:50.000000Z

I had another request for the exact same thing, actually. I'm planning to separate out the scraping library from the Chrome extension, and this project would be a good use case for that library.

echelon · 2024-09-04T02:58:16.000000Z

Maybe instead of selling scraping to end users, invert the problem and sell data to AI companies:

1. Pay users to install a browser extension that scrapes social media content they browse. Or ask them in exchange for a service, e.g. "remember everything I browse and make it searchable", etc.

2. Ship the data you scrape to your servers.

3. Sell training data to companies at a discount.

This gets past the new rate limiters and blocks that Reddit and others have installed.

roywiggins · 2024-09-04T04:27:39.000000Z

That's like 90% of the way to "free" VPNs that actually rent out your IP address to spiders, those have been around for ages.

What if we gave people some service, say a "browser toolbar", and in exchange we sell their browsing data to third parties?

You just reinvented spyware from first principles. This is basically BonziBuddy.

mkroman · 2024-09-05T15:36:24.000000Z

I've wanted to make something like this myself, so thanks and good job!

How does this work? Does it rely on GPT to extract the data or does it actually generate a bunch of selectors? If it's the former, then the results aren't reliable since it can just hallucinate whole results or even just parts.

marcell · 2024-09-05T22:55:48.000000Z

It uses GPT to extract the data: https://github.com/fetchfox/fetchfox/blob/public/src/lib/scr...

I haven't put together a good test framework yet, but qualitatively, the results are surprisingly good, and hallucinations are fairly low. The prompt tells GPT to say (not available) if needed, which helps.

I'm going to try the "generate selectors" approach as well. If you'd like to learn more or discuss just reach out via email (marcell.ortutay@gmail.com) or discord (https://discord.gg/mM54bwdu59 @ortutay)

HenryBemis · 2024-09-04T06:08:12.000000Z

Hmmm... assume one can harvest the financial news from some big website, then correlate it with market (historic) movements (when articles stating X Y and Z are posted, then after 24h Gold price dropped)(with a 80%-90% rate)(that could be used to predict and trade 'regularly')

smcin · 2024-09-04T00:26:39.000000Z

Interesting. How long did it take to figure out how to do this with ChatGPT?

> "By scraping raw text with AI, FetchFox lets you circumvent anti-scraping measures on sites like LinkedIn and Facebook. Even the the complicated HTML structures are possible to parse with FetchFox."

marcell · 2024-09-04T00:37:32.000000Z

The ChatGPT part is pretty easy actually. You can just dump text and HTML and ask it a question, and it usually answers correctly.

The trickier part is “everything else” to make the extension work.

FrenchDevRemote · 2024-09-04T05:25:23.000000Z

how do you deal with the fact that some basic pages can have tens of thousand of tokens?

marcell · 2024-09-04T16:37:03.000000Z

Right now, not much. The extension is fairly basic in that is just looks at the raw text + HTML and sends it to the LLM.

The benefit of this approach is it's very simple and easy, but the downside is it sends a lot of unnecessary tokens to the LLM. That drives up the cost, slows things down, and hurts accuracy.

I'm working on a few improvements now to improve this.

superasn · 2024-09-05T13:16:33.000000Z

I remember there was something called readability for chrome which is just what browsers have incorporated as reader view. And mozilla even had a stand-alone version of it (1). Might be of interest to you.

[1] https://github.com/mozilla/readability

smcin · 2024-09-04T00:43:39.000000Z

Even the parsing of obfuscated HTML + CSS + dynamic JSON content?

marcell · 2024-09-04T00:53:38.000000Z

Surprisingly yes, most of the time. I’ve put in a few optimizations:

1. Remove all <style> and <svg > tags. These rarely add value, and can dramatically increase token counts.

2. For the “crawl” step, I exclusively pull out <a> tags and only look at those. The “extract” step looks at full HTML

3. For now, it only looks at the first 50k text characters, and the first 120k HTML characters. This is to stay within token limits.

The last part will be what I focus on improving in the next version.

dillondoyle · 2024-09-04T02:50:53.000000Z

Could go the google way, capture an image screenshot of state, ocr, then parse it.

They keep throwing it in my url bar. I refuse to click (big warning it sends to google's servers)

ratata · 2024-09-04T01:55:32.000000Z

Nice! Any plans for Firefox support?

marcell · 2024-09-04T02:01:50.000000Z

Yea! Check back in a week or so

aayothered · 2024-09-05T07:56:59.000000Z

This is really cool, I don't know when I will need something like this, but I am sure the day will come! I hope the tech and the policies that govern scrapers are in place at that time! Best of luck

SomewhatLikely · 2024-09-04T07:39:22.000000Z

Can I recommend you provide some cost estimates next to the examples for using your own key? I tried a few custom extractions and then checked my usage dashboard and it was already over $2.

marcell · 2024-09-04T08:14:02.000000Z

Sorry about that! I’ll add this in the next version, expect it to be fixed in a few days.

starfallg · 2024-09-04T05:26:57.000000Z

This is interesting. How much difference is it (in cost, quality) by using this approach compared to taking a image capture of the page and then sending it off to a multi modal LLM?

marcell · 2024-09-04T06:43:18.000000Z

Good question, I actually haven't tried it with the image capture approach. I'll give that a shot and see how it performs. I'm planning to try many different AI extractors, and see which performs best.

So far, I've done some un-scientific testing to compare text vs. HTML. Text is a lot more effective on a per-token basis, and therefore lower cost. However, some data is only available in HTML.

platorob · 2024-09-04T01:32:10.000000Z

Useful! Keep up the good work!

hydrogenpolo · 2024-09-04T01:21:17.000000Z

Sweet ill take a look!

platorob · 2024-09-04T01:32:55.000000Z

Gm gm! This is a good tool. Tried it out to scrape for email from various professional sites. Thumbs up

natch · 2024-09-04T07:36:32.000000Z

Forgetting about law and copyright and robots.txt because hey most scrapers have to rely on fair use anyway, you even forget about basic consideration for the sites you hammer.

That sounds like a scold, but it's meant as an observation.

Now I will embed some implied scolding in what's to follow, but feel free to ignore that part; I wouldn't expect you to care.

But if you lack even a shred of human decency or morals, perhaps there's one more reason you might consider for spreading out your requests across sites and time, instead of absolutely pushing the abuse to the hilt, and that is that if you change what you are doing, and take a slower, more gentle approach, and I'm appealing to your selfishness here because clearly that is the only viable way into your head, then you will be less likely to cause countermeasures, and more likely to succeed.

marcell · 2024-09-04T08:13:25.000000Z

It’s actually a Chrome extension, and it runs in the user’s browser. You can configure the scraping rate and the app warns against higher rates (more than 5 tabs at a time).

I don’t think any sites are going to get hammered though, even at the fastest rates. The limiting factor is often LLM token rates.

natch · 2024-09-06T19:58:51.000000Z

Thanks for the clarification.