Hacker News new | past | comments | ask | show | jobs | submit login
Perplexity's grand theft AI (theverge.com)
41 points by latexr 10 months ago | hide | past | favorite | 50 comments



> At this point, Wired jumped in, confirming a finding from Robb Knight: Perplexity’s scraping of Forbes’ work wasn’t an exception. In fact, Perplexity has been ignoring the robots.txt code that explicitly asks web crawlers not to scrape the page

There's a distinction that is being missed between:

* Web crawlers automatically accessing pages, such as recursively following links to index them for search engines

* A tool accessing a URL in direct response to a user request

robots.txt is only intended for the former. For instance, archive.is state:

> [Why does archive.is not obey robots.txt?] Because it is not a free-walking crawler, it saves only one page acting as a direct agent of the human user. Such services don't obey robots.txt (e.g. Google Feedfetcher, screenshot- or pdf-making services, isup.me, …)

Perplexity do (as far as I've been able to find) respect robots.txt for their scraping. What the investigations confirmed it was ignored for was users entering a URL to summarise.

There is still a conversation to be had. I think users should be able to browse the web with whatever tool they want to - whether that's a standard browser, a minimal reader mode that doesn't show ads, or a statistical summarisation tool - but it does pose a problem for sites that rely on standard browsers letting them show users advertisements, get them to sign up to mailing lists, collect data, etc. if users decide to choose tools that don't allow that.


I have been pretty adamant about the privacy issues and stealing of content issues with AI... but this particular case seems, honestly I am on the side of perplexity here. Mostly...

Once we get to the point of this AI running locally, or you just have a browser extension that grabs the data and sends it to the server, what is the difference between what it is doing now and that.

If I use an AI and I give it a specific URL I think the expectation is that it is acting like a user and not like a crawler so the robots.txt is meaningless.

That doesn't mean that perplexity is free of any blame here, if they are getting around paywalls or then using this loophole to gather up the data to train later. That isnt ok. I am also concerned about the reports of them using third party crawlers to get around blocking them.

But, purely on what this seems to be focused on and the example given. I don't see anything wrong with it when it is triggered based on a specific action of the user. And should be the expected behavior IMO.

Headline grabbing articles like this I feel like undermine the valid criticism of AI because it can be used to paint a picture of a vendetta against AI or worse as an example of why these arguments are wrong as a way to discredit large swaths of criticism.

We really need to be more mindefull about how we criticize the real problems with the technology.


There’s a bit of a middle ground where the user may say something like “tech articles from new site” which would require the agent to move around reading and aggregating information to return to the user. Is this the former or the latter?


I think the dividing line is more "happens in anticipation there might be a user session which wants the data" and "happens as the result of a user session which wants the data" than it is about how it was done. E.g. prefetching content after a page load is done as a result of the user session getting to the page, it just as going around to any sites it can find to cache it on the chance the user might need them.


Exactly.

AI that answers questions using online sources isn't a web crawler, but a web researchers.


Isn’t it ironic that journalists themselves aggregate content in a way that discourages the reader from clicking through to the primary source?

This type of summarization and aggregation seems to be exactly what consumers want.


Bad journalists do. The good ones quote/link their sources. You can do the general idea of journalism in many ways. (Tabloid writers are journalists too... in theory)


And where does one find this ‘good’ journalism? It seems to be the exception rather than the norm as majority of the articles are summarizations, opinions, rip offs from reddit, clickbaits that quote from ‘experts’ or ‘sources’ who or not named


The most recent on my mind: Polymatter videos. For example the latest one https://youtu.be/H5EF8v0iGBs and the sources https://docs.google.com/document/u/0/d/1ph76k8iQVG5U1K2qpIup...


Is this about syndication or "facts can't be copyrighted"? Genuine question, I'm not sure how that works.


Clicking through to what, in person interviews with sources? Your comparison makes no sense


My impression is that most news now is just reported by a single source or a wire service, and 90+% of sites just write their own article based on that first news site's article.


Related:

Perplexity AI is lying about their user agent https://news.ycombinator.com/item?id=40690898


Thanks! Macroexpanded:

Perplexity AI is lying about their user agent - https://news.ycombinator.com/item?id=40690898 - June 2024 (531 comments)


> So that’s Perplexity’s real innovation here: shattering the foundations of trust that built the internet. The question is if any of its users or investors care.

Ron Howard voiceover: They do not.


I think what Perplexity is trying to build is cool, but seems like they do quite a bit of dodgy shit.

On his recent appearance on the Lex Fridman episode, Perplexity CEO Srinivas admitted that they'd abuse Twitter's academic grant program and autogenerated thousands of grant applications with GPT.

If he so laughingly reminisces about doing that, I wonder what kind of other dodgy shit they do behind closed doors.


So when I pay an intern to do a google search and make a short report of the facts that he was finding and giving it back to me, this is not a problem, but when I ask an AI to do the same it suddenly is problem. Well I dont get it.

just my 2 ct


The complaint here is the internet used to be "ad supported", and while showing ads to your intern isn't great, it's better than showing them to a web crawler


I believe that sooner or later we will see something similar to Perplexity that runs directly in your browser, opens the pages for you, and answers your question.

Unless it has already been done.


And people will see less and less of the original material, popups, cookie warnings, floating videos and offers to join a mailing list. It's so sad (not)


Right, I just realised this debate is basically the same debate towards ad-blocking and paywall jumping (e.g. via archive.is).

Of course opinions vary on HN but generally it feels like a "ad-blocking is cool, paywall hopping is cool, but the server also has the right to not give us the information".

Given this, this is pretty sensational reporting by The Verge -- calling it theft, comparing it to crypto, breaking the "foundations of the internet" (... the Internet is about sharing) etc. This is a non-story, we already have these tools, this just adds an AI summary layer on top.


>This is a non-story, we already have these tools, this just adds an AI summary layer on top.

This is a story. With ad blockers, the assumption was that the original page as a whole is valuable, but users just don't want to pay the cost of the ads and other obtrusive elements. But now we're opening up the question of what exactly is it on that page that's of actual value - is it just the text, or only some of the text, or only the "facts" expressed on the page, or maybe something even more abstract, such as how the page corroborates a fact on another page.

I find this to be a really interesting question, both on a practical level and on an epistemological level - what is it that we as readers actually (want to) get from a resource?


Fair enough, it is a very interesting question indeed. Though I think precedence might be found in newspapers copying each other's stories and "facts can't be copyrighted" kind of stuff.


Life is a shade of grays, in this particular case I think it's reasonable to want to block big giant ugly ads (those infested with js for tracking purposes) but still believe that the site should always be allowed to display a lot less intrusive advertisement, like mentioning a restaurant in the article body, or some links at the side bar for their social media or partner sites, summarization of this kind just nukes all possibilities of clean simple advertising so no, I don't think it's that close as you believe to be to ad-blocker extensions.


I disagree slightly. Ad blockers do block non-intrusive advertisements too, for example Ethical Ads, or sponsorblock on YouTube. To me it's less about intrusiveness/privacy and more about "I control what I see".

And in-text ads can already be removed by summarisation tools (even from before the AI era!) or by apps like Boring Report.


Just to be clear, sponsorblock has less than 3% the number of users of Ublock origin alone, so while an interesting utility and a interesting subject for the ad-blocking debate ethics-wise it has nowhere near enough prevalence to be significant when talking about ad-blockers used by the masses. And "the boring report" it's not even optimized for mobile viewing (Android), at least in my experience a few minutes ago: https://i.imgur.com/q6e3DzM.jpg


Except that they will see a lot less revenue and a lot less engagement and the creators will lose incentives to keep creating, to be clear I also wholeheartedly hate a lot of those practices but it's a fact that many of them increase engagement.


If the incentives for creating content decrease then where will AI source from?


AI has millions of users. Some have hundreds of millions. Those people bring data right into the AI mouth. The collective effect of so many chat sessions will surpass what we publish today, and many more people will be in the loop.


Arc Search does that already. Can’t recall which provider(s) they rely on.


I should have specified that by "runs directly in your browser" I meant running it locally on the machine.

EDIT: You still have to rely on a server to index the web pages of interest.


OpenAI is on track to shipping that:

https://multi.app/blog/multi-is-joining-openai


How is multi related to what the op is saying?


Assuming OpenAI wants the tech to control desktop applications using AI, the tech should be able to automate browser operations as well.


https://github.com/shreyaskarnik/DistiLlama (not fully integrated, you have to start ollama separately, but it is all local)


It doesn't need to be local. Google isn't local and the average joe is fine with that.


the content mentioned in the article was behind a paywall though, how are you going to circumvent that legally?


Bit confused about why perplexity in particular?

“Leveraging” other people’s content is basically par for the course. Whether it’s training or google news or Google books or stability A.I. images it’s all doing the same just to different degrees


Related:

Perplexity is a bullshit machine

https://news.ycombinator.com/item?id=40728732


Lying article lies.

> Though Forbes has a metered paywall on some of its work, the premium work — like that investigation — is behind a hard paywall

No it's not, see for yourself! I can read it just fine...

https://www.forbes.com/sites/sarahemerson/2024/06/06/eric-sc...


I can't read that, it just shows a full page about a flash sale and a link back to the homepage.


Interesting, I see exactly this: https://archive.is/aRPau


Copyright should die. It was already standing on just one foot since internet made mass copying trivial. File sharing has been going on for decades. Now they want to extend copyright to restrict AI which doesn't even replicate the source text, and instead aims to answer a user question by combining information across multiple sources.

We have seen other models that don't rely on protection flourish - Wikipedia, open source, scientific publications, open weights models, and even fashion. They all are permissive, and thriving.


Apply that idea to the journalism perplexity stole, which obviously has value. Forbes worked for months to create a story only to watch an IP launderer garner more financial reward.


I would agree if newspapers stopped reporting on social media posts, for which they don't ask permission or pay royalties. Not to mention that most articles are just rewording press agency news releases and other publications.

And isn't Google stealing the content from the whole internet and making a huge profit from it? Why the double standard. Perplexity has links to sources too.


If you pay attention to what I wrote vs what you wanted to read, you'd see the false equivalence is yours.

Pay-gating copyrighted content is different indeed from public-posted social media content. In the instance of the Forbes theft, Perplexity offered links to other copycats, but never the original [costly] reporting.


And how do you think all the content creators (either mainstream or independent) are going to keep churning quality content if they get one order of magnitude less traffic than before, while the AI "answer engines" sell premium subscriptions to provide answers based on scraped content bypassing paywalls?


AI engines are the future of search and task solving. They will absorb a lot of information from their users, much more than old search engines, such as problem solving strategies, what works and what doesn't.

Just consider the fact that chatGPT as 180M users and probably solves 1B tasks per month. It collects contextual data, guidance and feedback on a massive scale, going into the next model with some PII and copyright mitigations.

We will go to LLMs for advice because they will have the best data, refreshed by hundreds of millions of users. LLMs will naturally collect the most up to date feedback from the world, they don't even need to do anything, just wait for users to copy paste the relevant data and provide iterative guidance.


Right, and how will LLMs determine what data they are "collecting" was written by themselves or other AI's. Eventually all AI content will be AI content derived from AI content. Is that a good answer?


How did you determine that chat gpt "solves 1B tasks per month"?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: