More

pierrefar · on Aug 3, 2023

The major problem with Brave search is their position about indexing and licensing content against the wishes of the website publisher. Their robot does not identify itself, meaning the publisher cannot use the standard robots.txt to block its crawling if the publisher so wishes. Incidentally, the robots.txt file has been used in court cases litigating if a search engine is legal or not.

Even worse, they state that Brave search won't index a page only if other search engines are not allowed to index it. It is morally not their right to make that call. A publisher should have full control to discriminate which search engine indexes the website's content. That's the very heart of why the Robots Exclusion Protocol exists, and Brave is brazenly ignoring it.

Even worse than that, the Brave search API allows you (for an extra fee) to get the content with a "license" to use the content for AI training? Who allowed them the right to distribute the content that way?

I wrote about all this here:

https://searchengineland.com/crawlers-search-engines-generat...

and more references elsewhere in this thread:

https://news.ycombinator.com/item?id=36989129

Amusingly, while I was writing my article, this got posted to their forums, asking about how to block their crawler:

https://community.brave.com/t/stop-website-being-shown-in-br...

No reply so far.

yreg · on Aug 3, 2023

Hmm, I don't know, it doesn't seem obvious to me that it is unethical to disobey the publisher wishes.

If you post something to the open web, what's it to you who reads it and how? You can block some IPs but that's about it.

I don't know if Brave has a knowledge graph - if they do, I would understand objecting if they filled it in with “stolen” content. But I don't see what's the problem with search.

By the way, isn't everyone's favourite archive.is doing the same thing?

I have no strong opinion on this, curious to hear counter arguments.

CaptainFever · on Aug 3, 2023

I’m just thinking that if website publishers are able to legally allow Googlebot but block other bots, it might contribute to the Google monopoly.

pierrefar · on Aug 3, 2023

That would be bad, and it is already bad that Google and Microsoft control so much of search queries, but the decision about which search engine indexes a website is purely the publisher's.

onli · on Aug 3, 2023

A publisher publishes. Once something is published, once something is public, the control a publisher has over the published thing is limited. For example, a publisher can not choose who reads a book after it is sold, who reads an article after it got printed. It is not a given at all that a publisher should have any say about being indexed. The search engine relies on a public fact - X wrote Y. That's legal (limitations apply).

The position of Brave of not accepting to be blacklisted if not all search engines are blacklisted is a pragmatic one. It works against Google's search monopoly, but still gives them some legal coverage since the robots.txt is not completely ignored, in case that indeed matters somewhere. I think it's an elegant approach well suited for the current state of the web, one that serves the greater good. And Brave, but that's completely fine.

pierrefar · on Aug 4, 2023

A search engine index is an economic exchange between the website and the publisher.

To massively (over)simplify the argument to its essence (and ignore other important points): the publisher goes through the trouble and expense of creating the content The publisher then allows its content to be copied by a search engine only because being shown in search results gets it traffic back. The traffic it gets in return has value, and the publisher is happy for this arrangement to continue as long as the value of the traffic is more than the cost of producing and serving the content.

Brave offering a "license", for its own financial benefit, to "allow" others to use the content for LLM training gives zero benefit to the original publisher. This is why I use words like "sleazy" to describe Brave's position.

This argument applies to Google and Microsoft. Right now both are failing at citing sources in their generative AI search results. That is terrible and I hope it's fixed soon, as otherwise they're being sleazy scrapers as much as Brave is.

Finally, I wholeheartedly disagree they what Brave is doing is for the "greater good". The fact they charge extra for the "license" to use the content for LLM training shows that.

onli · on Aug 4, 2023

> A search engine index is an economic exchange between the website and the publisher.

A search engine index is a search engine index. It can have an economic impact, but it can not be an economic exchange, since it is a technical artifact.

Though I think I understand what you are trying to say - this is also a commercial relationship where both sides can profit. You are free to interpret the relationship between publishers and search engines with such a capitalist lense, that does not mean those mechanisms govern the actual rules. That a publisher is happy with what happens here is of no real concern. If any rules apply we are talking copyright, maybe media law, where happy is not a relevant category (ok, that of course can matter, it wouldn't here if the search engine uses a right).

I did not touch the LLM training data in my comment, as I did not read up on what Brave is really doing there. If Brave were really to sell complete texts from others, that would not be legal under copyright laws I'm familiar with, so I kinda doubt they do that.

GoblinSlayer · on Aug 4, 2023

It doesn't look like there's a restriction. The content is publicly available to anyone.

user_named · on Aug 4, 2023

You are completely wrong. A publisher controls the publication of a work, text or other, including it's duplication and licensing. Other companies cannot xerox a book and sell it. Clear?

memefrog · on Aug 4, 2023

Indexing has clear analogues from before the Internet. It is obviously not copyright infringement. Quoting a small snippet to give context for a search result also obviously isnt copyright infringement.

onli · on Aug 4, 2023

So much so that publishers created special law to make unpaid snippets illegal in the context of global search engines, specifically to target Google, outside of copyright. Happened in the EU, Canada (I think) and Australia.

onli · on Aug 4, 2023

Clear?!

I specifically gave the example of selling it not being covered by what I'm talking about. But other parties can definitely "xerox" a book and read it. Use its content. Talk about what is written in it. Quote it. There are limits to the controlling rights of a publisher.

user_named · on Aug 4, 2023

Xeroxing a book and selling it isn't legal, you realize that right.

onli · on Aug 4, 2023

> I specifically gave the example of selling it not being covered by what I'm talking about.

user_named · on Aug 4, 2023

Useless. This discussion is about braver selling the indexed content via their api.

onli · on Aug 4, 2023

No it's not.

cvalka · on Aug 3, 2023

No, it's not.

yreg · on Aug 3, 2023

But why do you believe so?

pierrefar · on Aug 4, 2023

I replied in detail elsewhere:

https://news.ycombinator.com/item?id=36993739

jaharios · on Aug 3, 2023

This make me want to use Brave search now. When I use a tool I expect it that it serves me, not the material it provides.

> A publisher should have full control to discriminate which search engine indexes the website's content

If you want someone to not see what you publish block him yourself. Also why would you want to do that? Do you want google to own the web or something?

pierrefar · on Aug 3, 2023

There is a a difference between a human being able to access content vs a search engine indexing it (and in the case of Brave, "licensing" it on).

I share your concern about Google having this much power, and I'd add that Microsoft Bing is equally bad but gets away with it because they're smaller. Still, the final decision about which search engine indexes a website is purely the publisher's.

memefrog · on Aug 4, 2023

There is a difference between Americans and Chinese people but that doesnt mean discriminating on racial lines is justified. Just saying "there is a difference" isnt an argument. Indexing a website isnt the same as reading it, but it is a form of consumption and I see no reason why they should be treated differently.

And to use that analogy even further, if you want to block Chinese visitors you block Chinese IPs. You do not add a file called "countries.txt" containing "China block" and then expect Chinese users to see it and voluntarily cease to use it, and threaten to sue them if they don't.

Repeatedly asserting that "the final decision is with the publisher" is stupid. That is the point you seem to want to defend. Defend it! Give us a reason. Just saying the same thing over and over again doesnt make it true.

Grimburger · on Aug 4, 2023

> There is a a difference between a human being able to access content vs a search engine indexing it

Much of the problem with search today arises from websites showing googlebot what it wants to see and showing real users. I have to manually remove entire domains from google search as they often appear 1st yet don't show any content without me signing up for an account. Clearly that's not what they are showing to google.

There should be no differentiation between a crawler and a human being with regards to what is being served.

waithuh · on Aug 4, 2023

If the website owner wants google to own the web, they should be able to restrict their website. Nothing wrong with that.

bastawhiz · on Aug 4, 2023

Let's say I pay for Kagi. Kagi is a tool that I'm using to avoid doing hard work manually. With relatively few exceptions, I can probably accomplish what I use a search engine for manually, but with much more time and effort. So I'm paying for a tool to assist me with my use of the web. A "user agent", you might even say.

It simply doesn't sound right to say which tool a user can use. It's literally the same as arguing that you should be able to block Firefox from accessing your website and it's Mozilla's fault that they don't respect your wishes as a webmaster to block Firefox exclusively. Or that a VPN doesn't publish its IP addresses so that you can block it. Or a screen reader that processes the text to speech in a way that you disagree with.

Philosophically it seems intuitive to say "I should be able to block a third party that is abusing my site" but it's ignoring the broader context of what "open web" and "net neutrality" actually mean.

I run a service for podcasters. There are podcast apps and directories that either ignorantly make unnecessary requests for content or have software bugs that cause redownloads. I could trivially block them, but I don't because doing so penalizes the end user who is ultimately innocent, rather than the badly behaved service operator. The better solution is primitives like rate limiting, which I use liberally. Plus, blocking anyone literally has a direct effect of incentivizing centralization on Apple, Spotify, etc. and making the state of open tech in podcasting even worse.

> the Brave search API allows you (for an extra fee) to get the content with a "license" to use the content for AI training? Who allowed them the right to distribute the content that way?

I don't think there's any court at this point that would back you up that freely published content annotated with full provenance cannot be scraped and published for a fee. Services like this have existed for decades. If you don't want your content scraped, put it behind a login. Especially considering this only applies when you allow other search engines and if you think Google and Bing aren't using your content to train AI, you're off your rocker.

sangnoir · on Aug 4, 2023

> With relatively few exceptions, I can probably accomplish what I use a search engine for manually, but with much more time and effort. So I'm paying for a tool to assist me with my use of the web. A "user agent", you might even say

1. User agents should identify themselves

2. A crawler is not a User agent - it's an agent for Brave

>I don't think there's any court at this point that would back you up that freely published content annotated with full provenance cannot be scraped and published for a fee.

You can't end-run copyright like this: just because something is publicly available doesn't mean anyone can redistribute it. Look at the legal issues & cases relating to Library Genesis.

bastawhiz · on Aug 4, 2023

> User agents should identify themselves

There is no rule that this is true, and many user agents exist _specifically to not be identified_. See Tor and other privacy-centric user agents.

> A crawler is not a User agent - it's an agent for Brave

You know, I thought "what does Wikipedia have to say on this matter?" and sure enough:

> Examples include all common web browsers, such as Google Chrome, Mozilla Firefox, and Safari, as well as some email readers, command-line utilities like cURL, and arguably headless services that power part of a larger application, such as a web crawler.

I can't even make that up.

> just because something is publicly available doesn't mean anyone can redistribute it

You're mistaking reselling content with providing access to it. By your logic, caching proxy servers would be illegal on the grounds of copyright. The physical act of downloading files necessarily creates copies of the data every step of the journey from the source server to you. There's a material difference between paying someone for a copy of some content and paying someone to fetch content for you on your behalf. Nothing about copyright law specifically requires the person physically acquiring the content is the one who ends up consuming it.

memefrog · on Aug 4, 2023

Downloading something isnt redistributing it. It is your website. You provide what is on it to me. I send you an HTTP request. You dont have to respond. You do. I am not copying anything. Copyright simply isnt engaged at any point in this process.

cvalka · on Aug 3, 2023

They are right and you are wrong. If some web page is publicly available, it should be indexed. Scraping neutrality, please.

waithuh · on Aug 4, 2023

Heavily disagree. I own the server, thus the website. I should be able to allow or disallow any type of web crawler/scraper i want. Similar to how you cant easily regulate whats in a website without lawsuits and takedowns, you cant regulate how discoverable a website is.

ric2b · on Aug 4, 2023

> I should be able to allow or disallow any type of web crawler/scraper i want.

You're certainly allowed to try, but I don't see why indexers should be mandated to collaborate with you. They serve their users, not you.

tympious · on Aug 5, 2023

Will their users appreciate that they disregard the intent of the authors of what they index?

I mean, "allow" or "regulate" don't _really_ apply here - there was never any enforcement regime around robots.txt, just a convention based on the general expectation that you don't claim ownership of whatever passes your line of sight.

tympious · on Aug 4, 2023

What if I want what I publish to be known only by word of mouth?

What if I consider (some or any of) my ideas to be un-indexable, not directly suitable to representation in any hierarchy other than those I may set them in?

Grimburger · on Aug 4, 2023

Then you should hide them behind a url that isn't linked elsewhere on your site that you can easily propagate by word of mouth only.

    example.com/correcthorsebatterystaple

If you consider "word of mouth" to be public posts on a forum which millions can read at any time then block googlebot IP's

jaystraw · on Aug 4, 2023

...after decades of what i considered friendship, here you are on hn talking about my horse battery staple

tympious · on Aug 4, 2023

Yes, sorry, it was a rhetorical question in response to previous.

Taking either step you suggest (along with robots.txt or eqiv.), it would seem fair to expect that Brave, Bing, whomever, would not feel it their neutral/natural domain to include in a public index.

memefrog · on Aug 4, 2023

Then dont publish it.

vGPU · on Aug 4, 2023

> The Robots Exclusion Protocol is a mechanism for publishers to discriminate between what users and crawlers are allowed to access, and discriminate between different crawlers (for example, allow Bingbot to crawl but not Googlebot).

To me as a search engine end user, this kind of behavior is undesirable. Why would I want a website to selectively degrade my experience because of my choice in search engine or browser?

Brings back horrible flashbacks of “this website is only compatible with IE6”.

1vuio0pswjnm7 · on Aug 3, 2023

Curious why cannot selectively block using IP address instead of user-agent string. According to HTTP specification, UA is not a required header. There is certainly no technical requirement for it in order to process HTTP requests. Of course, any website could block requests that lack a UA header. I never send one and it's relatively rare IME to see a site require it, but it's certainly possible.

pierrefar · on Aug 3, 2023

This is explained more in the article I referred to, but briefly: Brave delegates crawling to normal Brave browsers, so it's a huge IP addresses pool, not a single IP address or range.

Also, these search crawls by the browser do not identify themselves beyond the Brave standard UA header, namely a plain Chrome user-agent string.

1vuio0pswjnm7 · on Aug 4, 2023

According to Brave Privacy Policy, participation in the Web Discovery Project is "opt-in". How many Brave users have opted in to sending data to Brave.

How many Chrome users have opted in to sending data to Google.

Sometimes uninformed consent is not actually consent. These so-called "tech" companies love to toe that line.

pierrefar · on Feb 18, 2023

This article is way out of date and wrong, although published recently (at least according to the timestamp).

Google's official position was published on 8 February here:

https://developers.google.com/search/blog/2023/02/google-sea...

It's a much more nuanced position that can be summarized as "make sure you create good content, however you create it". A focus on quality, not process, is reasonable.

Disclosure: ex-Googler in search.

hirako2000 · on Feb 18, 2023

Thanks for the actual source, seems like ex's are often the one who care the most.

About the news article title, looks like it's trendy to make Google seemingly worried about chatGPT. and it's been a trend to claim google penalises its potential competitor via the means it can (search).

With google quality content challenges, it makes sense they would focus on quality, and if AI generated is what make search users click more that's what it will feed most, especially if that's promoted content, as it has always been.

pierrefar · on Nov 18, 2019

In simplified terms, did you find everything you could have possibly found? Looking at the formula in the article, it includes the false negatives, that is, items you misclassified as negatives when you should have considered them positives. And because that happened, you didn't find them in the set, that is you "forgot them". The opposite of forgetting is... recall.

Another place this idea comes up is a search engine index. If the algo doesn't find, for a given query, documents in the index it should have (falsely classified as not matching the query), it will have lower recall.

nodoodles · on Nov 18, 2019

Ah, that makes sense - thanks! Been using the word for a while but never figured out the linguistic logic, TIL, awesome.

pierrefar · on Nov 17, 2019

Looks like the stages of mitosis (cell division):

https://www.nature.com/scitable/topicpage/mitosis-and-cell-d...

The faint lines are the cell walls and the bright spots in the middle would be the DNA. I can believe this is what they're going for with a bit of squinting.

pierrefar · on Nov 3, 2019

Very neat. I built a virtually identical internal tool for Blockmetry. A couple of tips from experience:

1. Add other browse extensions, and you'll see a big difference between their effects. Defaults matter a lot in this space.

2. Compare mobile vs desktop. Getting mobile emulation to be good enough is a bit of work, but worth it IMO.

Based on internal usage, the typical web page will load 35-45% faster with uBlock Origin installed.

My email address is my profile if you want to compare notes or whatnot.

pierrefar · on July 9, 2019

No that's not a solution. It's the tracking that counts, not the cookies. I commented elsewhere on this thread more details:

https://news.ycombinator.com/item?id=20394661

lucb1e · on July 9, 2019

And as I said six years ago https://news.ycombinator.com/item?id=6233362 :p (not to diminish your comment!)

pierrefar · on July 9, 2019

Before anyone thinks this (and similar) approaches are a way around the GDPR's cookie consent tracking crackdown: It's not.

The GDPR talks about online identifiers, of which cookies, IP address and fingerprints are examples. If you read any regulator's guidance carefully, you'll see they talk about "cookies and similar technologies", with just "cookies" being used alone for brevity.

To rephrase tracking of any kind is the issue, not cookies. Don't mistake the implementation for the activity.

Disclosure: Founder of a non-tracking web analytics service because of this exact issue.

lucb1e · on July 9, 2019

All true. Small note: whereas cookies are easy to identify as tracking, etags have a legitimate purpose and you might not know that you're being tracked by this method. It would be illegal not to disclose it, so I doubt any self respecting company would do it, but also hard to detect.

pierrefar · on Jan 27, 2019

Congrats on the launch.

The privacy policy is very not suited for this service. The most important point is that you're based in Germany based on the address in the policy, but there isn't a single mention of the GDPR. That and the ePrivacy Directive are what count for you the most. My recommendation is don't use a free policy generator and get proper advice. I appreciate this isn't something commonly seen as a launch blocker, but it's important to sort it out properly.

Find your German state data protection authority, and invariably you'll find they have great guidance.

ArdentZeal · on Jan 27, 2019

Thanks - I have not looked too deep into that yet, basically just scratching the surface and back to coding, but I have it on MY chaoslist :)

pierrefar · on Jan 8, 2019

Yes, and also cookie IDs. Both are called out as examples in recital 30:

“Natural persons may be associated with online identifiers provided by their devices, applications, tools and protocols, such as internet protocol addresses, cookie identifiers or other identifiers such as radio frequency identification tags. This may leave traces which, in particular when combined with unique identifiers and other information received by the servers, may be used to create profiles of the natural persons and identify them.”

Source: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=celex%3A...

pitaj · on Jan 8, 2019

I should have been more clear that I meant IP addresses alone.

It seems like this only addresses when IP addresses are combined with other data.

kasey_junk · on Jan 8, 2019

That particular guidance has provided a lot of gnashing of teeth because some readings of it have an implicit “because” before the final sentence. That is IP addresses are personal data as sometimes they uniquely identify people.

The guidance my firm received was to treat them, by themselves, as an ID. YMMV.

zaarn · on Jan 9, 2019

IP Addresses are Personal Data.

I think the easy way to check is to ask yourself if the data can directly link to someone's IRL identity.

If no, ask yourself if the police could identify them if they demanded and got the data.

If still no, ask yourself if the data is of a protected category (gender, religion, sexuality, etc.).

If you need any of this data, minimize your need first (ie this means storing IPs only for a limited timespan, german authorities have IIRC recommended 7 days as normal).

If you can't reduce your need, find another way to do what you do that has less need.

If all else fails, cover under legitimate interest and hope you're not Adtech.

pierrefar · on Jan 9, 2019

Here is a write-up of the decision from EU’s highest court on this topic: https://www.whitecase.com/publications/alert/court-confirms-...

It’s easy to see why quote I gave says what it says with this context.

Also, if you’re worries, talk to your lawyer.

dronescanfly · on Jan 9, 2019

DSGVO concerns 'directly' identifying information (Name, SSN...)

aswell as 'indirectly' identifying information like IP adresses where the technical possibilities exist to link them back to the person.

EVEN if you do not actively link them to the person DSGVO treats them the same way as the directly identifying information

pierrefar · on Dec 28, 2018

To add to the key point about privacy, this research from Princeton is really illuminating and scary: https://freedom-to-tinker.com/2017/11/15/no-boundaries-exfil...

jagracey · on Dec 28, 2018

I might be alone on this one, but I feel the Freedom-to-Tinker report was unfair to the analytics providers. I know the folks in the industry work really hard towards privacy and security. They go out of their way to make it clear that not everything is automatically censored, and provide easy tools to limit data and visualize what is and isn't recorded. Holding PII and other sensitive data truly is a liability- nobody wants it.

Companies like Walgreens should be entirely to blame.

I really do appreciate how they author(s) in that report uncovered how those services where used in practice.

[I'm not with any party listed in the report]