Hacker News new | past | comments | ask | show | jobs | submit login

Read this article if you want to know Perplexity’s idea of taking other people’s content and thinking they can get away with it,

https://stackdiary.com/perplexity-has-a-plagiarism-problem/

The CEO said that they have some “rough edges” to figure out, but their entire product is built on stealing people’s content. And apparently[0] they want to start paying big publishers to make all that noise go away.

[0]: https://www.semafor.com/article/06/12/2024/perplexity-was-pl...




It's been debated at length, but to make it short: piracy is not theft, and everyone in the LLM space has been taking other people’s content and so far getting away with it (pending lawsuits notwithstanding).


> so far getting away with it (pending lawsuits notwithstanding).

I know it feels like it's been longer, but it's not even been 2 years since ChatGPT was released. "So far" is in fact a very short amount of time in a world where important lawsuits like this can take 11 years to work their way through the courts [0].

[0] https://en.m.wikipedia.org/wiki/Oracle_v_Google


In 9 years time, robots will publish articles on the web, and they will put a humans.txt file at their root index to govern what humans are allowed to read the content.

Jokes aside, given how models become better, cheaper and smaller, RAG classification and filtering engines like Perplexity will become so ubiquitous that i don't see any way for a website owner to force anyone to visit the website anymore.


I'd believe it if they were targeting entities that could fight back, like stock photo companies and disney, instead of some guy with an artstation account, or some guy with a blog. To me it sounds like these products can't exist without exploiting someone and they're too coward to ask for permission because they know the answer is going to be "no."

Imagine how many things I could create if I just stole assets from others instead of having to deal with pesky things like copyright!


...which is a great argument for abolishing copyright:P


...which is a great argument for how unjust is a law that only protects those that can afford it.

Cheaper processes to protect smaller creators in cases like these is what is really needed.


> pending lawsuits notwithstanding

That’s a hell of a caveat!


> piracy is not theft

Correct, but it is often a licensing breach (though sometimes depending upon the reading of some licenses, again these things are yet to be tested in any sort of court) and the companies doing it would be very quick to send a threatening legal letter if we used some of their output outside the stated licensing terms.


If using copyrighted material to train an LLM is theft, so is reading a book.


So if I get access to the Perplexity AI source code (I borrow it from a friend), read all of it, and reproduce it at some level, then Perplexity will be:" sure, that's fine no harm, no IP theft, no copyright violation, because you read it so we're good"?

No, they would sue me for everything I got, and then some. That's the weird thing about these companies, they are never afraid to use IP law to go after others, but those same laws don't apply to them... because?

Just pay the stupid license and if that makes your business unsustainable then it's not much a business is it?


Funny enough, their prompts leaked: https://www.reddit.com/r/perplexity_ai/s/kn6i20kMLH

And I’ve built a perplexity clone in about a day - it’s not that hard: search -> scrape results -> parse results —> summarize results -> summarize aggregate results into single summary.

I’m really not sure I even see their moat.


What have you used if i may ask? It seems very simple indeed. What search API is best?

Also there is a program called html2text to throw out the html formatting so as to use less tokens. Have you used this or something similar?


Brave API (Bing is good as well). Here's a little gist (Elixir). It's pretty rudimentary so far and needs refining, but works alright enough (result at bottom): https://gist.github.com/cpursley/b4af2ff3b56c912f659bd5300e4...

The most useful part is probably the prompt and usage of Phi 3 Mini 128K Instruct for web page summarization and Llama 3 for the final summary (of the summaries). I'm parsing out all but minimal content html but might even remove that to keep context length down.


Very nice, thank you!


If Perplexity’s source code is downloaded from a public web site or other repository, and you take the time to understand the code and produce your own novel implementation, then yes. Now, if you “get it from a friend”, illegally, _or_ you just redeploy the code, without creating a transformative work, then there’s a problem.

> Just pay the stupid license and if that makes your business unsustainable then it's not much a business is it?

In the persona of a business owner, why pay for something that you don’t legally, need to pay for? The question of how copyright applies to LLMs and other AI is still open. They’d be fools to buy licenses before it’s been decided.

More importantly, we’re potentially talking about the entire knowledge of humanity being used in training. There’s no-one on earth with that kind of money. Sure, you can just say that the business model doesn’t work, but we’re discussing new technologies that have real benefit to humanity, and it’s not just businesses that are training models this way.

Any decision which hinders businesses from developing models with this data will hinder independent researchers 10 fold, so it’s important that we’re careful about what precedent is set in the name of punishing greedy businessmen.


> They’d be fools to buy licenses before it’s been decided.

They are willingly ignoring licenses until someone sues them? That's still illegal and completely immoral. There is tons of data to train on. The entirety of Wikipedia, all of StackOverflow (at least previously), all of the BSD and MIT licenses source code on Github, the entire Gutenberg project. So much stuff, freely and legally available, yet their feel that they don't need to check licenses?


The legality of their behavior is not currently well defined, because it's unprecedented. Fair use permits transformative works. It has yet to be decided whether LLMs and their output qualify as transformative, or even if the training is capable of infringing copyright of an individual work in the first place if they're not reproducing it. In fact, there's a good amount of evidence which indicates that fair use _does_ apply, given how Google operates and what they've argued successfully (https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com...).

Purchasing licenses when you are already entitled to your current use of the work is just bad business, especially when the legal precedent hasn't been set to know what rights might need to exist in said license.

You might not like the idea of your blog posts or other publicly posted materials being used to train LLMs, but that doesn't make it illegal (morality is subjective and I'm not about to argue one way or another). If it's really that much of a problem, you _do_ have the ability to remove your information from public accessibility, or otherwise protect it against LLM ingestion (IP restrictions, etc.).

edit: I am not a lawyer (this is likely obvious to any lawyers out there); this is my personal take.


Note that not all jurisdictions have the concept of "fair use" (use of copyrighted material, regardless of transformation applied, is permitted in certain contexts…ish). Canada, the UK, Australia, and other jurisdictions have "fair dealing" (use of copyrighted material depends on both reason and transformation applied…ish). Other jurisdictions have neither, and only licensed uses are permitted.

Because the companies behind large models (diffusion, LLM, etc.) have consumed content created under non-US copyright laws and have presented it to people outside of US copyright law jurisdiction, they are likely liable for misapplication of fair dealing, even if the US ultimately deems what they have done as "fair use" (IMO this is unlikely because of the perfect reproduction problems that plague them all in different ways; there are likely to be the equivalent of trap streets that will make this clearly copyright violation on a large scale).

It's worth noting that while models like GitHub Copilot "freely" use MIT, BSD (except BSD0), and Apache licensed software, they are likely violating the licenses every time a reasonable facsimile pops up because of the requirement to include copies of the licensing terms for full or partial distribution or derivation.

It's almost as if wholesale copyright violations were the entire business model.


You're right. I'm definitely taking a very US-centric view here; it's the only copyright system I'm familiar with. I'm really curious how jurisdictions with no concept of fair use or fair dealing work. That seems like a legal nightmare. I expect you wouldn't even be able to critique a copyrighted work effectively, nor teach about it.

When you speak of the "perfect reproduction" problem, are you referring to cases where LLMs have spit out code which is recognizable from source training data? I agree that that's a problem, but I expect the solution is to have a wider range of training data to allow the LLM to better "learn" the structure of what it's being trained on. With more/broader training data, the resulting output should have less chance of reproducing exactly what it was trained on _and_ potentially introduce novel methods of solving a given problem. In the meantime, it would probably be smart for some kind of test for recognizable reproduction and for the answers to be thrown out, perhaps with a link to the source material in their place.

There's also a point, however, where the same code is likely to be reproduced regardless of training. Mathematical formulas and algorithms come to mind. If there's only one good solution to a problem, even humans are likely to come up with the same code without even seeing each others output. It seems like there's a grey area here which we need to find some way to account for. Granted this is probably the exception, rather than the rule.

> It's almost as if wholesale copyright violations were the entire business model.

If I had to guess, this is probably a case where businesses are pushing something out sooner than it should have been. I find it unlikely that any business is truly basing their model on something which is so obviously illegal. I'm fully willing to believe, however, that they're willing to ignore specific instances of unintentional copyright infringement until they're forced to do something about it. I'm no corporate apologist. I just don't want to see us throw this technology away because it has problems which still need solving.


I live in a fair dealing jurisdiction, and additional uses would need to be negotiated with the rights holders. (I believe that this is part of the justification behind the Canadian law on social media linking to news organizations.) It is worth noting that in addition to the presence or absence of fair dealing/fair use, there are also moral rights which must be considered (which is another place where LLM tech — especially the so-called summarization — likely falls afoul of the law: authors have the moral right to not be misrepresented and the LLM process of "summarization" may come to the opposite conclusion of what the author actually wrote).

Perfect reproductions apply not only to software, but to poetry, prose, and images. There is a reason why diffusion model providers are facing lawsuits over "in the style of <artist>", because some of the styles are very distinctive and include elements akin to trap streets on maps (this happens elsewhere — consider the lawsuit and eventual settlement over the tattoo image used in The Hangover 2).

With respect to "training it on more data", I do not believe you are correct — but I have no proof. The public statements made by the people who have done the training have suggested that they have done such training on extremely wide and deep sources that have been digitized, including a number of books and the wider Internet. The problem is that, on some subjects, there are very few source materials and some of those source materials have distinctive styles which would be reproduced when discussing those subjects.

I’m now more than thirty years into my career. Some algorithms will see similar code written by humans, but most code has some variability outside of those fairly narrow ranges. Twenty years ago, I derived the Diff::LCS library for Ruby from the same library for Perl, but I look back on the original code I ported from and I cannot recognize the algorithms (this is a problem for wanting to consider how to implement things differently). Someone else might have ported it differently and chosen different trade-offs than I did. Even simple things like the variable names chosen likely differ between two developers for similarly complex pieces of code implementing the same algorithm.

There is an art to programming — and if someone has a particular coding style (in Ruby, think Seattle style as distinct) which shows up in copilot output, then you have a possible source for the training.

Finally, I believe you are being naïve about businesses basing their model on "something which is so obviously illegal". Might I remind you of Uber (private care hires were illegal in most jurisdictions because it is something that requires licensing and insurance), AirBnB (private hotel-style rentals were illegal in most jurisdictions because it is something that requires licensing and insurance and specific tax filings), Napster (all your music are belong to no one, at least until the musicians and their labels got involved), etc. I firmly believe that every single commercial LLM available now — possibly with the exception of Apple's, because they have been chasing licensing — is based on wholesale intentional copyright violations. (Non-commercial LLMs may be legal under fair use and/or fair dealing provisions, which does not address issues for content created where neither fair use nor fair dealing apply.)

I am unwilling to give people like sama the benefit of the doubt; any copyright infringement was not only intentional, but brazen and challenging in nature.

I'm frankly looking forward to the upcoming AI winter, because none of these systems can deliver on their promises, and they can't even exist without misusing content created by other people.


> Purchasing licenses when you are already entitled to your current use of the work is just bad business, especially when the legal precedent hasn't been set to know what rights might need to exist in said license.

Your take on how all this works is probably more inline with reality than mine, it's just that my brain refuse to comprehend the willingness to take on that type of risk.

You're basically telling investors that your business may be violating all sorts of IP laws, you don't know and have taken no actions to determine that. It's just a gamble that this might work out, while taking billions in funding. There's apparently no risk assessment in VC funding.


> If Perplexity’s source code is downloaded from a public web site or other repository, and you take the time to understand the code and produce your own novel implementation, then yes.

Even that can be considered infringement and get you taken to court. It's one of the reasons reading leaked code is considered bad and you hear terms like cleanroom[0] when discussing reproductions of products.

[0]: https://en.wikipedia.org/wiki/Clean_room_design


It certainly can be, but it's not guaranteed. Clean room design is one way to avoid a legally ambiguous situation. It's not a hard requirement to avoid infringement. For example, the US Supreme Court ruled that Google's use of the Java APIs fell under fair use.

My point is: just because certain source material was used in the making of another work does not guarantee that it's infringing on the rights of that original IP.


Reading a book is not theft. Building a business on processing other people's copyrighted material to produce content is.


I think that's called a school


Main issues:

1) Schools use primarily public domain knowledge for education. It's rarely your private blog post being used to mostly learn writing blog posts.

2) There's no attribution, no credit. Public academia is heavily based (at least theoretically) on acknowledging every single paper you built your thesis on.

3) There's no payment. In school (whatever level) somebody's usually paying somebody for having worked to create a set of educational materials.

Note: Like above. All very theoretical. Huge amounts of corruption in academia and education. Of Vice/Virtue who wants to watch the Virtue Squad solve crimes? What's sold in America? Working hard and doing your honest 9 to 5? Nah.


1) If your blog posts are private, why are they on publicly accessible websites? Why not put it behind a paywall of some sort?

2) How many novels have bibliographies? How many musicians cite their influences? Citing sources is all well and good in academic papers, but there’s a point at which it just becomes infeasible. The more transformative the work, the harder it is to cite inspiration.

3) What about libraries? Should they be licensing every book they have in their collections? Should the people who check the books out have to pay royalties to learn from them?


> 1) If your blog posts are private, why are they on publicly accessible websites? Why not put it behind a paywall of some sort?

If I grow apple trees in front of my house and you come and take all apples and then turn up at my doorstep trying to sell me apple juice made from the apples you nicked that doesn't mean you had the right to do it, because I chose not to build a tall fence around my apple trees. Public content is free to read for humans, not free for corporations to offer paid content generation services based on my public content taken without me knowing or being asked for permission.

> 2) How many novels have bibliographies? How many musicians cite their influences? Citing sources is all well and good in academic papers, but there’s a point at which it just becomes infeasible. The more transformative the work, the harder it is to cite inspiration.

You are making this kind of argument: "How much is a drop of gas? Nothing. Right, could you fill my car drop by drop?"

If we have technology that can charge for producing bullshit on an industrial scale by recombining sampled works of others, we are perfectly capable of keeping track of the sources used for training and generative diarrhoea.

> 3) What about libraries? Should they be licensing every book they have in their collections? Should the people who check the books out have to pay royalties to learn from them?

Yes https://www.bl.uk/plr


All of these responses were so quality, there's really no need to add. I Especially like the apple argument about a product in your front yard. You still have no basis to take them from my front yard.

If there was the equivalent of what a lot of other sites have (gems, gold, ribbons) I'd give you one. Got a lot of gems, I'll send you an admittedly teeny heliodore, tourmaline, or peridot at cost if you want one. Gemstone market's junk lately with the economy.


You're both just repeating the "you wouldn't download an apple" argument. In the context of the Internet, you're voluntarily sending the user an apple and expecting them to not do various things to it, which is unreasonable. Nothing is taken. If it were, your website would be completely empty.

Remember, Copying Is Not Theft. Copyright law is just a temporary monopoly meant to economically incentivize you. Nothing more.

BTW, pro-AI countries do differentiate between private and public posts. If it's public, it's legally fair game to train on it. If it's private, you need a license to access it. So it does matter. Also see: https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn


Schools use books that were paid for and library lending falls under PLR (in the UK), so authors of books used in schools do get compensated. Not a lot, but they are. AI companies are run by people who will loot your place when you're not looking and charge you for access to your own stuff. Fuck that lot.


> AI companies are run by people who will loot your place when you're not looking and charge you for access to your own stuff.

Funnily enough they do understand that having your own product used to build a competing product is uncool, they just don't care unless it's happening to them.

https://openai.com/policies/terms-of-use/

> What you cannot do. You may not use our Services for any illegal, harmful, or abusive activity. For example [...] using Output to develop models that compete with OpenAI.


Schools pay for books, or use public domain materials


If you think going to school to get an education is the same thing as training an LLM then you are just so misguided. Normal people read books to gain an understanding of a concept, but do not retain the text verbatim in memory in perpetuity. This is not what training an LLM does.


LLMs don’t memorize everything they’re trained on verbatim, either. It’s all vectors behind the scenes, which is relatable to how the human brain works. It’s all just strong or weak connections in the brain.

The output is what matters. If what the LLM creates isn’t transformative, or public domain, it’s infringement. The training doesn’t produce a work in itself.

Besides that, how much original creative work do you really believe is out there? Pretty much all art (and a lot of science) is based on prior work. There are true breakthroughs, of course, but they’re few and far between.


Some people memorize verbatim. Most LLM knowledge is not memorized. Easy proof: source material is in one language, and you can query LLMs in tens to a hundred plus. How can it be verbatim in a different language?


If you buy a copy of Harry Potter from the bookstore, does that come with the right to sell machine-translated versions of it for personal profit?

If so, how come even fanfiction authors who write every word themselves can't sell their work?


Doujinshi authors sell their work all the time.


These "some people" would not fall under the "normal people" that I specifically said. but you go right ahead and keep thinking they are normal so you can make caveats on an internet forum.


> Normal people read books to gain an understanding of a concept, but do not retain the text verbatim in memory in perpetuity.

LLMs wouldn't hallucinate so much if they did that, either.


I think this is tricky because of course this is okay most of the time. If I produce a search index, it's okay. If I produce summate statistics of a work (how many words starting with an H are in John Grisham novels?) that's okay. Producing an unofficial guide to the Star Wars universe is okay. "Processing" and "produce content" I think are too vague.


You should be able to judge whether something is a copyright violation based on the resulting work. If a work was produced with or without computer assistance, why would that change whether it infringes?


It helps. If it's at stake whether there is infringement or not, and it comes that you were looking at a photograph of the protected work while working on yours (or any other type of "computer assistance") do you think this would not make for a more clear cut case?

That's why clean room reverse engineering and all of that even exists.


As a normative claim, this is interesting, perhaps this should be the rule.

As a descriptive claim, it isn't correct. Several lawsuits relating to sampling in hip-hop have hinged on whether the sounds in the recording were, in fact, sampled, or instead, recreated independently.


There were also cases that (very broadly speaking) claimed that songs were sufficiently similar to constitute a copyright infringement https://en.wikipedia.org/wiki/Pharrell_Williams_v._Bridgepor...

This is interesting from the legal point of view, because AI service providers like OpenAI give you "rights" to the output produced by their systems. E.g. see the "Content" section of https://openai.com/policies/eu-terms-of-use/

Given that output cannot be produced without input, and models have to be trained on something, one could claim the original IP owners could have a reasonable claim against people and entities who use their content without permission.


If the LLM is automatically equivalent to a human doing the same task, that means it's even worse: The companies are guilty of slavery. With children.

It also means reworking patent law, which holds that you can't just throw "with a computer" onto something otherwise un-patentable.

Clearly, there are other factors to consider, such as scope, intended purpose, outcome...


Computers are not people. Laws differ and consequences can be different based on the actor (like how minors are treated differently in courts). Just because a person can do it does not automatically mean those same rights transfer to arbitrary machines.


Corporations are people. Not saying that’s right. But is that not the law?


Corporations are legal persons, which are not the same as natural persons (AKA plain old human beings).

The law endows natural persons with many rights which cannot and do not apply to legal persons - corporations, governments, cooperatives and the like can enter into contracts (but not marriage contracts), own property (which will not be protected by things like homestead laws and the such), sue, and be sued. They cannot vote, claim disability exemptions, or have any rights to healthcare and the like, while natural persons do.

Legal persons are not treated and do not have to be treated like natural persons.


Is reading a book the same as photocopying it for sale?

Which of the scenarios above is more similar to using it to train a LLM?


If I was forced to pick, LLMs are closer to reading than to photocopying.

But, and these are important, 1) quantity has a quality all of its own, and 2) if a human was employed to answer questions on the web, then someone asked them to quote all of e.g. Harry Potter, and this person did so, that's still copyright infringement.


But you pay money to buy a book and read it.


Not if you check it out from the library


The library paid. Similarly, you can't go to a public library, photocopy entire books, then offer them for sale behind a subscription based chatbot.


>Not if you check it out from the library

...who paid money for the book on your behalf


Is it same as human reading a book?

We are not even giving same rights to other mammals. So why should we give it to software.


How is a human reading a book in any way related or comparable to a machine ingesting millions of books per day with the goal of stealing their content and replacing them?


Directly.

What if while reading you make notes - are you strealing content? If yes - should then people be forbidden from taking notes? How does writing down a note onto a piece of paper differ from writing it into your memory?


The nice thing about law as opposed to programming is that legal scholars have long realized it's impossible to cover every possible edge case in writing so judges exist to interpret the law

So they could easily decide logically unsound things that make pedants go nuts, like taking notes, or even an AI system that automatically takes notes, could be obvious fair use, while recording the exact same strings for training AI are not.


> The nice thing about law as opposed to programming

in programming that is called "Undefined behavior"


Because humans cannot reasonably memorize and recall thousands of articles and books in the same way, and because humans are entitled to certain rights and privileges that computer systems are not.

(If we are to argue the latter point then it would also raise interesting implications; are we denying freedom of expression to a LLM when we fine-tune it or stop its generation?)


it's comparable exactly in the way 0.001% can be compared to 10^100

humans learning is the old-school digital copying. computers simply do it much faster, but it's the same basic phenomenon

consider one teacher and one student. first there is one idea in one head but then the idea is in two heads.

now add book technology1 the teacher writes the book once, a thousand students read it. the idea has gone from being in one head (book author) onto most of the book readers!


> humans learning is the old-school digital copying. computers simply do it much faster, but it's the same basic phenomenon

This is dangerous framing because it papers over the significant material differences between AI training and human learning and the outcomes they lead to.

We all have a collective interest in the well-being of humanity, and human learning is the engine of our prosperity. Each individual has agency, and learning allows them to conceive of new possibilities and form new connections with other humans. While primarily motivated by self interest, there is natural collective benefit that emerges since our individual power is limited, and cooperation is necessary to achieve our greatest works.

AI on the other hand, is not a human with interests, it's an enormously powerful slave that serves those with the deep pockets to train them. It can siphon up and generate massive profits from remixing the entire history of human creativity and knowledge creation without giving anything back to society. It's novelty and scale makes it hard for our legal and societal structures to grapple with—hence all the half-baked analogies—but the impact that it is having will change the social fabric as we know it. Mechanistic arguments about very narrow logical equivalence between human and AI training does nothing but support the development of an AI oligarchy that will surely emerge if human value is not factored in to how we think about AI regulation.


you're reading what I say in the worst possible light

if anything, the parallel I draw between AI learning and humans learning is all the opposite of narrow and logical... in my intent, the analogy is loose and poetic, not mechanistic and exact.

AI are tools, if AI are enslaving is because there are human actors (I hope....) deciding to enslave other humans, not because of anything inherent to training (if AI; learning if humans)

but what I really think is that there are collections of rules (people "just doing their jobs") all collectively but disjointedly deciding that it makes the most sense to utilize AI technology to ensalve other humans because the data models indicate greater profit that way.


Your response is fair and I hope you didn't take my message personally. I agree with you, AI is just a tool same as countless others that can be used for good or evil.


> humans learning is the old-school digital copying. computers simply do it much faster, but it's the same basic phenomenon

Train an LLM on the state of human knowledge 100,000 years ago - language had yet to be invented and bleeding edge technology was 'poke them with the pointy side.' It's not going to be able to do or output much of anything, and it's going to be stuck in that state for perpetuity until somebody gives it something new to parrot. Yet somehow humans went from that exact starting to state to putting a man on the Moon. Human intelligence, and elaborate auto-complete systems, are not the same thing, or even remotely close to the same thing.


> bleeding edge technology was 'poke them with the pointy side.'

Relevant: https://www.smbc-comics.com/comic/rise-of-the-machines


I hate to argue this side of the fence, but when ai companies are taking the work of writers and artists en mass (replacing creative livelihoods with a machine trained on the artists stolen work) and achieving billion dollar valuations that’s actual stealing.

The key here is that creative content producers are being driven out of business through non consensual taking of their work.

Maybe it’s a new thing, but if it is, it’s worse than stealing.


Right, it's ironic we spent 30 years fighting piracy and then suddenly corporations start doing it and now it's suddenly ok.


For me, the irony is the opposite side of the same coin, 30 years of "information wants to be free" and "copyright infringement isn't piracy" and "if you don't want to be indexed, use robots.txt"…

…and then suddenly OpenAI are evil villains, and at least some of the people denounced them for copyright infringement are, in the same post, adamant that the solution is to force the model weights to become public domain.


I broadly agree with you, but I don't see what's contradictory about the solution of model weights becoming public domain.

When it comes to piracy, the people who have viewed it as ethical on the grounds that "information wants to be free" generally also drew the line at profiting from it: copying an MP3 and giving it to your friend or even a complete stranger is ethical, charging a fee for that (above and beyond what it costs you to make a copy) is not. From that perspective, what OpenAI is doing is evil not because they are infringing on everyone's copyright, but that they are profiting from it.


To me, it's like trying to "solve The Pirate Bay" by making all the stuff they share public domain.

But thank you for sharing your perspective, I appreciate that.


The deal of the internet has always been: send me what you want and I’ll render it however I want. This includes feeding it into AI bots now. I don’t love being on the same side as these “AI” snakeoil salesmen, but they are following the rules of the road.

Robots.txt is just a voluntary thing. We’re going to see more and more of the internet shut off by technical means instead, which is a bummer. But on the bright side it might kill off the ad based model. Silver linings and all that.


but information wants to be free

I say this given what I understand information to be

information is about knowledge, what use is knowledge that nobody can know? useless, hence it must be the case that information wants to be copied everywhere it can, freely; for that is the essence of being information, being known.

information wants to be famous


Evil villains to individuals, if what they were doing was actually open.

Then sure, but they're getting a pass because of capitalism and dcma was getting that same pass.


Aereo, Napster, Grokster, Grooveshark, Megaupload, and TVEyes: they all thought the same thing. Where are they now?


Heh, you're right, of course, but as someone who came of age on the internet around that era, it still seems strange to me that people these days are making the arguments the RIAA did. They were the big bad guys in my day.


They were massacred by well funded corps. Who is on the side of single joes?


You wouldn't train an LLM on a car.


I cannot imagine how viewing/scraping a public website could ever be illegal, wrong, immoral etc. I just don't see the argument for it.


It's scraping content to then serve up that content to users who can now get that content from you (via a paid subscription service, or maybe ad-sponsored) instead of visiting the content creator and paying them (i.e., via ads on their website)

It's the same reason I can't just take NYT archives or the Britannica and sell an app that gives people access to their content through my app.

It totally undercuts content creators, in the same way that music piracy -- as beloved as it was, and yeah, I used Napster back in the day -- took revenue away from artists, as CD sales cratered. That gave birth to all-you-can-eat streaming, which does remunerate artists but nowhere near what they got with record sales.


One more point on this, lest some people think, "hey Kanye, or Taylor Swift, don't need any more money!" I 100% agree. But the problem with streaming is that is disproportionately rewards the biggest artists at the expense of the smaller ones. It's the small artist, barely making a living from their craft, who were most hurt by the switch from albums to streaming, not those making millions.


As a musician, Spotify is the best thing to happen to musicians. Imagine trying to distribute your shit via burned CDs you made yourself. The entitlement of thinking "I have a garage band and Spotify isn't paying me enough" is fucking ridiculous. 99.99% of bands have never made it. The ability to easily distribute your music worldwide is crazy. If people don't like it, you're either bad at marketing, or, more likely, your music is average at best. It's a big world.


Read up on how Spotify remunerates artists.


I have multiple Spotify artists. I get it and think it's a fantastic service. Anyone complaining about it probably gets a couple dozen monthly plays because they don't know how to market, gig, and tour, or more likely their music sucks.


Spotify pays $400 a month for 100,000 streams[0]. And that may be split between the artist and a publisher if the artist went through one (probably not if they're small). So an artist has to be extremely popular to get any real money from streaming.

The way smaller artists make money is through live gigs (nothing wrong with that).

[0] https://soundcamps.com/spotify-royalties-calculator/


Serve it in a better way or wall it. The Internet is supposed to be free. If you don't want unauthorized eyes to see it, you have the ability to hide it behind logins.


Free to access != free to copy and redistribute for profit


This will further push websites to paywalls making the internet less feee.


AI hysteria has made everyone lose their minds over normal things.


I guess people just LOVE twisting themselves in knots over some "ethical scandals" or whatnot. Maybe there's a statement on American puritanism hiding somewhere here...


Can’t wait for OpenAI to settle with The New York Times. For a billion dollars no less.


Only reason OpenAI would do that would be to create a barrier for smaller entrants.


> Only reason OpenAI would do that would be to create a barrier for smaller entrants

Only? No. Not even main.

The main reason would be to halt discovery and setting a precedent that would fuel not only further litigation but also, potentially, legislation.

That said, OpenAI should spin it as that master-of-the-universe take.


A billion dollar settlement is more than enough to fuel further litigation.


> billion dollar settlement is more than enough to fuel further litigation

The choice isn’t between a settlement and no settlement. It’s between settlement and fighting in court. Binding precedent and a public right increase the risks and costs to OpenAI, particularly if it looks like they’ll lose.


Right, but a billion dollars to a relative small fry in the publishing industry (even online only) like the ny times is chum in the water.

The next six publishers are going to be looking for $100B and probably have the funds for better lawyers.

At some point these are going to hit the courts, an NY Times probably makes sense as the plaintiff as opposed to one of the larger publishing houses.


> ny times is chum in the water

The Times has a lauder litigation team. Their finances are good and their revenue sources diverse. They’re not aching to strike a deal.

> NY Times probably makes sense as the plaintiff as opposed to one of the larger publishing houses

Why? Especially if this goes to a jury.


I, on the other hand, hope NYT refuses a settlement and OpenAI loses in court.


Be careful what you wish for, because, depending on how broad the reasoning in such a decision would be, it is not impossible that the precedent would be used to then target ad blockers and similar software.


Fair point, but it's a risk I'd be willing to take.


Same, for sure!


Settling for a billion dollars would be insane. They'd immediately get sued by everyone who ever posted anything on the internet.


> piracy is not theft

it was when Napster was doing it; but there's no entity like the RIAA to stop the AI bots


[flagged]


Something not being stealing isn't the same as it not being able to hurt people or companies financially. Revenue lost due to copyright breach is not money stolen from you.

I pay my indie creators fairly; big companies is when I stop caring.


What indie game dev shut down because of piracy?


>and thinking they can get away with it

Can they not? I think that remains to be seen.


Exactly. It's like when Uber started and flaunted the medallion taxi system of many cities. People said "These Uber people are idiots! They are going to get shut down! Don't they know the laws for taxis?" While a small number of cities did ban Uber (and even that generally only temporarily), in the end Uber basically won. I think a lot of people confuse what they want to happen versus what will happen.


In London, uber did not succeed. Uber drivers have to be licensed like minicab drivers.


Perhaps. But a reasonable license requiring you to pass a test isn't the same as a medallion in the traditional American taxi system. Medallions (often costing tens or even hundreds of thousands of dollars) were a way of artificially reducing the number of taxis (and thus raising the price).


This. Medallion systems in NYC were gamed by a guy who let people literally bet on its as if it were an asset. The prices went to a million per until the bubble burst. True story


Uber is widely used in London, so they succeeded.

If they had waited decades for the regulatory landscape to even out they would have failed.


They succeeded commercially, but they didn't succeed in changing the regulatory landscape. I'm not sure what you mean by waiting for it to even out. They refused to comply, so they were banned, so they complied.


Uber is banned in multiple countries and pulled out of many more because they where told to follow the law and that makes their business unprofitable.


So? They have a market cap of $150 billion. If at the start they had decided "oh well let's not bother since what we are doing is legally ambiguous" they would have a market cap of $0.


And that's great, they are making a lot of money in markets where they are allowed to operate and comply with local laws.

I'm just interested in seeing if AI companies can do the same, if they are going to be required to pay licenses on their training data.


Americans are incredibly ignorant of how the world actually works because the American living memory only knows the peak of the empire from the inside.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: