Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There's a reason why you can't do this commercially and why Google isn't doing it already... Pulling the meat of the content from a site like StackOverflow ends up as a copyright/anti-trust violation.

I'm fairly certain that this was the reason why Google had to tamp down it's rich results that were made mostly from Wikipedia entries.

More recently it was shopping...

"Google argues that ‘rich results’ in Search provide more direct experience in antitrust suit response": https://9to5google.com/2020/12/17/google-search-antitrust-re...

"Google loses appeal, faces €2.4 billion shopping antitrust fine": https://arstechnica.com/gadgets/2021/11/google-loses-appeal-...



You're confused, copyright != antitrust violations.

Both sources you provide have zero mentions of the word 'copyright' in them.

Those lawsuits have to do with Google's dominating the search market and using that to their advantage in ways that are allegedly unfair.

Copyright law actually allows a service like Google to exist in the first place.


According to OP's model, for anyone wondering:

>Copyright infringement and antitrust violations are two distinct types of improper use. Plagiarism is an ethical violation that occurs when someone attempts to pass off someone else's work or ideas as their own, without properly giving credit to the original source. It is not against the law, but can have serious consequences such as failing grades, termination, and difficulty finding new employment. On the other hand, copyright infringement occurs when a party takes an action that implicates one or more of the rights listed above without authorization from the copyright owner or an applicable exception or limitation in the copyright law, such as fair use. The most common antitrust violations fall into two categories: agreements to restrain competition and efforts to acquire a monopoly.


Yes, thank you. I’m neither confused nor disagree with you. I simply cited the most recent, easily found examples where Google ran afoul for rich results.

There are plenty of examples out there, as mentioned the ones prior to the recent shopping ones. Feel free to dig.


The possibility that a LLM could trigger a copyright violation strengthens the narrative that Google is harming smaller business, and thus can easily be used as a data point in an antitrust lawsuit.


Thinking more on this... I don't think any of these sites will live if they get big enough. And if enough of them pop up it'll draw tons of attention from content sites.

If you want to show that data you'll end up having to work out a license from StackOverflow. Possible, but far more difficult than the current ease of plug-and-play GPT drop-in.

Do we really think Google hasn't thought of this exact thing already?


Google is already working on LaMDA and Imagen for conversational search experiences, which is why these projects also wax poetic about "AI safety" -- you don't want to synthesize a politically incorrect or socially unacceptable response to a question asked.

Apart from the copyright issues that parent mentions, there's also the issue of LLM spewing BS confidently, which is why Google has been hesitant to roll it out as their default.


Agreed.

This post sums up the other issues outside of copyright that these types of services are certain to run into…

1. DeepMind 2. Infrastructure 3. Trust 4. Freshness 5. Habit Breaking

https://www.maxinomics.com/blog/fade-the-chatgpt-hype


Stack Overflow user content is licensed under the Creative Commons license, so it's possible you actually could satisfy the license terms. That said, IANAL, and I have no idea if it's possible to fulfill the SA clause without distributing the model, or something like that.


The reason is likely simpler:

- It is expensive (~0.5c per generated answer)

- It is (currently) slow (2-3 seconds to result)

- It is hard to place ads inside direct answers (probably the most important)


Those problems don't seem insurmountable, especially if it is 10-100x better than Google.


If it's a good result I'm sure there are many people that would pay 1c per search. I've made 16 searches today, far less for stuff I didn't find with ddg. If I was after something specific I could charge my account with $5 and search away.


I agree, but that is not how Google search (currently) operates.


Great opportunity for someone to disrupt

Costs are only going to come down.


What's funny is that most of this ground breaking LLMs you see now are based on Google published research about transformers, and they have better performing models in house than anything publicly available on the market.


Note that pulling the meat of the content from StackOverflow isn't copyright violation though, as long as you follow the license (which is Creative Commons something-something but probably fine for this particular application).


But it's siting the sources, how is it a copyright violation?


Citing does not confer a license


You don't need a license to cite others.


But what about when you're also reproducing the content on your own page like what's being done here?


It's tricky but you don't need a license for that either.

With tricky I mean that only under very specific circumstances you would be infringing copyright laws, like maybe if the content was private in the first place; but then in that case you wouldn't be infringing copyright either, you would just be breaking privacy laws/terms.

I honestly can't think of an example where you would get in trouble by citing a piece of content that belongs to someone else, but I'm not closed to the possibility that it could happen.


> only under very specific circumstances you would be infringing copyright laws

It's the reverse of this. Any work public or not by default is all rights reserved to the owner. The fair use doctrine provides an exception from this if you meet specific criteria.

An extreme example, you cannot just upload a complete movie and just add "credit to disney".

> you wouldn't be infringing copyright either, you would just be breaking privacy laws/terms.

Maybe we're from different countries, but with US law it would be either under theft, the computer fraud/abuse act and/or copyright violations, there are no privacy laws applicable here unless we're talking about PII.

Extracting very specific examples from an article or blog is almost certainly going to fall under fair use. However I've seen several cases where it essentially just returns an entire article as the answer which would certainly be legally ambiguous.


Look up the fair use doctrine. You can reproduce parts of content.


One of the four factors is market impact which in this case would likely fail.

In the words of ChatGPT:

> When determining the potential impact on the market for the original work, courts will consider whether the use of the copyrighted material is likely to harm the market for the original work. This may include whether the use of the copyrighted material would compete with the original work, such as if it is used as a substitute for the original work or if it would reduce the demand for the original work.

As such this is at least not clearly a fair use case. (And arguably quite possibly a not fair use case)


It's not, I disagree with GP's argument. Safe harbors in copyright law exist to allow this.


It's quite likely a fair use violation...

1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; 2. the nature of the copyrighted work; 3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and 3. the effect of the use upon the potential market for or value of the copyrighted work.

Particularly 3 & 4.

https://en.wikipedia.org/wiki/Fair_use#U.S._fair_use_factors


Safe harbor covers hosts of user uploaded content. The copyright owner can pursue the infringer, they are not protected by safe harbor.


Tho now I think about it more, it might be damaging the site money of income (ads, etc). But it's still not a copyright violation.


"it might be damaging the site money of income"

Which is one of the key factors of determining Fair Use and Fair Use falls under copyright.


It could also be bringing new money to those sites by referring users to them.

So, ¯\_(ツ)_/¯.

The issue with Google had more to do with antitrust behavior than with copyright infringement.


I might be bringing new money to movie publishers by pirating their movies and sending clips to my friends, but that's not a valid basis for calling it fair use


> Pulling the meat of the content from a site like StackOverflow ends up as a copyright/anti-trust violation.

Then how did ChatGPT do it?


Non profit right?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: