For a bit of historic accuracy if it ever matters for future readers:
Kagi was founded in 2019 and we have operated for years in private beta with thousands of users before public beta release this June.
Goggles were not inspired by Kagi”s lenses and I can confirm seeing the whitepaper before we got the lens feature out last year.
Kagi”s Lenses were inspired by Blekko’s “slashtags” which is probably the original “prior art” for this kind of feature.
Looks like we arrived to similar idea, but different execution. Kagi”s Lens feature is osimple to create filter for the web, that anyone can make with a few clicks plus a bunch of powerful built-in lenses like “noncommercial” or “discussions” search.
We applaud all these innovations and are glad to see what is being done with Kagi Lenses and Brave Goggles. The Web ecosystem and users need innovations in search.
We are currently bringing this back (in Beta). RollYo innovated too (private Beta August 2005). Google Custom Search launched in October 2006. So there were at least 3 services that predate Blekko (2010).
I'm a paying Kagi user because doublequotes works, i.e. if I search for something using doublequotes around it, Kagi actually makes sure those words are in the page, and if it fails they create a bug report and fix it.
It seems to be better in every other way too, but that is actually the single reason why I pay for it.
That is what I corrected, assuming you meant we were announced to the public in 2022.
Kagi was announced to the public in 2019, long before public beta release this June. I understand it is kind of hard to track small, bootstrapped startups with no mainstream exposure, but as I said this is for historic accuracy.
Filter lists can be hosted anywhere and imported with the @ syntax:
# Make these domains stand out in results
+en.wikipedia.org
+stackoverflow.com
+github.com
+api.rubyonrails.org
# SPAM - never show these results
experts-exchange.com
# Pull filters from external source
@https://clobapi.herokuapp.com/default-filters.txt
This default list is the only one I distribute but users have come up with own lists.
It would be nice to have a Github repo with such lists (or meta lists: the @ syntax works recursively, allowing lists to import other lists).
Your suggestion of having a standard for the list syntax is interesting.
An active choice is better than a passive one, if only because it requires an effort, in that respect the explicitness is an advantage over the typical personalization.
The article also mentions that Goggles will not stop polarization, it suffices to not exacerbate it.
No technology/system on any period of time has been able to suppress it, censorship included.
I think they were going for "Active choice is [considered] bad [by] people who think they know better and want to control others." At least, that's my read.
I'd like to correct some factually incorrect information regarding Brave Search.
Brave search crawls the web through the Web Discovery Project and has its own crawler, which fetches a bit more than 100M pages daily.
Brave search uses Bing API and Google fallback for about 8% of the results shown to the users, the remaining 92% are served from our own index, when we launched almost 1 year ago the number of results from 3rd parties was 13%.
There is no need to mention "multiple source" when a number can be given. The underlying theme here is not if DDG provides no value on top of Bing, it does, no one is questioning that. The question is whether DDG would be able to operate if Bing were to shut DDG down tomorrow.
If Bing and Google were to disappear tomorrow, for whatever reason, Brave search would continue to operate, that's the independence Brave search is building.
What factually incorrect information was posted? Maybe I missed it.
Yegg said "they all rely _somewhat_ on either Google's or Bing's web crawling" and you confirmed it by saying "Brave search uses Bing API and Google fallback for about 8%". So... which part is factually incorrect?
Edit: Misread the second part, removed that portion of my statement.
> The question is whether DDG would be able to operate if Bing were to shut DDG down tomorrow.
No, that doesn't appear to be the question at all. The original post appears to be an attempt to smear DDG by posting misleading information that you know will confuse users into thinking that their search engine sends PII to Microsoft when you know it doesn't. The original tweet doesn't appear to mention Bing shutting down at all. Here's the entirety of the tweet:
"This is shocking. DuckDuckGo has a search deal with Microsoft which prevents them from blocking MS trackers. And they can't talk about it!
This is why privacy products that are beholden to giant corporations can never deliver true privacy; the business model just doesn't work."
I see nothing in there questioning whether DuckDuckGo will still be around if Bing goes under. I also see nothing in yegg's response above that has anything to do with this irrelevant question you mention.
There is plenty of comments discussing on the provenance of DDG results, including from Gabriel himself, which is the one we both have participated in,
"it is misleading to say our results just come from Bing."
Discussing how many sources can one bring together it's a distraction to not discuss the degree of dependency between DDG and Bing. More-so when claiming that others suffer from the same, which is factually incorrect for Brave search.
Mixing with Google results only can happen after opt-in and only in Brave browser. You can see if a single query has been mixed clicking on the `Info`, or check the independence metrics on the `Settings` tab.
The fact that you see results similar to Google for popular queries is a by-product of the fact that our ranking is trained using anonymous query-log. There is plenty of references to the methodology (https://0x65.dev/).
The fact that we are similar to Google on certain types of queries, is good (at from the perspective of human assessment). It's easy to find other types of queries for which we are not similar to Google. It would be rather stupid if we were to "use google" on easy to solve queries but not on the complicated ones, don’t you think? In any case, very nice article besides a couple of miss-conceptions (like this one), will bookmark.
Disclaimer: work at Brave search, used to work at Cliqz
That makes a bit more sense; I just read the blog posts. I'm concerned about the effects of optimizing against Google (namely, the extremely similar results); I don't think I understand the point of an alternative if it tries to replicate a competitor to this degree. The whole idea I was going for in that article was a diversity of information sources: if one engine isn't giving the results you want, try another.
Right now, users who want Google results and privacy can use a Searx instance or Startpage.
You bring a very good point on the diversity of information sources, which is something we plan to attack in the near future with open ranking [0]
In my opinion having similar results to Google will facilitate adoption. After all, Google is pretty good for many types of queries (not all), and people in general have strong habits.
The fact that we are similar with our own index is great. It means that we have the power of deviating from it when needed, as we mature/evolve.
Allow me to repurposed your statement on why not use startpage if you want Google-like results: if tomorrow Google disappears (or for some reason becomes unusable), brave search will continue to operate as normal (similar to old Google). What will happen to searx or startpage? What till happen to ddg or swisscows if the provider turning bad is Microsoft. IMHO, no matter how much reranking or nice features they you put on top, unless you do not control the search results themselves, diversity can only be superficial.
Sorry for the "rant". Thanks a lot for the inputs and for updating the doc, appreciate it.
It is trivial to de-anonymize if records are linkable, which is the case you mention on Dark Data DEFCON25. Another famous case was the de-anonymization of the Netflix data set.
However, you are assuming that HumanWeb data collection is record-linkable, which is not the case, precisely to avoid this attack.
If what is being collected is linkable: e.g. (user_id, url_1), ... (urser_id, url_n). No matter how you anonymize user_id, it will eventually leak. A single url containing personal identifiable information, e.g. a username, will compromise the whole session. No matter how sophisticated the user_id generation is. The real problem, privacy-wise, is the fact that record can be linked to the same origin. An attacker (or the collector) has the ability to know if two records have the same origin.
The anonymization of HumanWeb, however, ensures that linkability across data points is not present. Hence, an attacker cannot know if two records come from the same origin. As a consequence, the fact that one url might give away user data, for instance a username, it would not compromise all the urls sent by that person.
I still see a lot of ways in which users could be de-anonymized, sometimes a single URL is already sufficient and side channels like the quorum mechanism might leak information as well. Maybe it's really anonymous, but personally I don't trust any mechanism that doesn't have a statistical anonymity guarantee, differential privacy being the preferred one as it's the only anonymity model that hasn't been broken yet.
Anyway, it's great that Cliqz did this work and I don't want to diminish it, I'm just very cautious when companies claim they're only collecting anonymous data, there were just too many cases in which promises have been broken.
Mozilla never did such a thing. The browsing history was never sent in any shape or form. As the journalistic article you quote states, Mozilla put in place the HumanWeb[1,2,3], which was a privacy preserving data collection which ensured record-unlinkability, hence no session or history. Anonymity was guaranteed and the framework was extensively tested by privacy researchers from both Cliqz and Mozilla.
Disclaimer: I worked at Cliqz.
The chosen excerpt omits the fact that it is predicated on the HumanWeb. In the technical papers above there is a more precise description on what and how was collected. There was no user tracking, session or history being sent as all data points are anonymous and record-unlinkable by the receiver. The vague language, required for a general audience journal, certainly does not help.
There was no tracking on Cliqz, nor it will be any in Brave. To know more about the underlying tech of Cliqz there are interesting posts at https://0x65.dev, some of them covering how signals are collected, data, but no tracking.
I did work at Cliqz and now I work at Brave. I can tell for a fact, that all data was, is and will be, record-unlinkable. That means that no-one, not me, not the government, not the ad department can reconstruct a session with your activity. Again, there is no tracking, full anonymity, Brave would not do it any other way.
Brave buying Cliqz is the first corporate acquisition that's actually made me feel better about the acquirer, ever. I have no idea how to react to that. Keeping up the dev blog would probably make me start recommending Brave, where before I recommended against it.
Incidentally, do you know what's happening to the Cliqz browser?
You are missing the entire point, "search" is sadly about advertising, not about the search itself :-)
Bing is interested serving DDG, Qwant, Ecosia and a lot of other unknown search engines because of the aggregated reach they provide for their ad network. Ad-networks only work if the aggregated audiences are massive, otherwise advertisers do not bother putting their ads there, only the top-3/5 ad networks get to see any action. So Bing wants/needs a bigger audience just to be on the game. They can grow in 2 paths: 1) increase Bing search reach (difficult), or 2) use partners with different value propositions.
Bing charges little for 1K query, 1USD officially but it gets cheaper, to zero :-) The real thing though, is that if you display Bing ads, you get a 70%-90% rev-share of the ad-revenue, which varies from country to country, something between $5 to $20 per 1K queries.
So, DDG basically gets around 5$ to 10$ net for each 1K, and can spend all that money on distribution and marketing so that they get even more users. Bing gets the rest, money, and what's more important, their ad-network continues to be competitive.
Everybody wins, right? :-/
Search is so cheap 1$/1K queries and people makes 2/3 queries per day, so $1/year/user (average). It makes no economic sense to build an alternative. Unless of course, you are building out of "ideals".
Just for archive reasons. There are some interesting points worth addressing (IMHO). Of course I worked at Cliqz :-)
"The company only survived because of the investor throw a lot of money". 100% correct, and that speaks greatly about the investor. They believe that Google is a monopoly that needs to fought, as many others. But, instead of (or on top of) bitching and moaning, lobbying, etc. they put good money where their mouth was. Kudos for that.
Privacy was never Cliqz primary product. Privacy was a strict design requirement of Cliqz, which can be marketed more or less. Data collection and browsers alike, we wanted them to be private, because that's the right thing to do, even if it was more difficult to implement. The whole data vs. privacy argument is fallacious. One of the reasons why privacy was so important to us is precisely now, whoever ends up owning the data cannot learn anything about any of the users. Imagine the government getting Google's data if they go belly up or upon "legal" request (change Google by any other company). The data of Cliqz poses no risk to any user, including myself.
The primary product of Cliqz was search, either as the typical result page or instant search integrated on the browser. That's very difficult to build, and expensive, something that DuckDuckGo, Startpage, Qwant, etc. do not have to pay because they rely on the backend of others (not 100%, but mostly). If we were repackaging Bing/Google/Yandex with a different ranking twists, our quality would have been better from the beginning, of course. But that's not building an alternative to Google, which is what we wanted. Still, that's not a pun to DDG and others, what they provide has value to the users, of course. But they are not real alternative, kind of an electric car that gets its electricity from burning coal.
Brave is a great browser, respects to Brendan and team. We both "fight" against Google. For Brave it's Chrome, for Cliqz was both Chrome and Search. Too much to chew? Yes, but we had plenty of fun. The only thing I regret after +6 years working there is the loss of such a great team.
Did Cliqz ever consider bootstrapping with Bing/Google/Yandex results? Supplement Cliqz results with those backends until Cliqz results got as good as you wanted them to be?
I'll always support privacy conscious search engines (I'm a DDG daily user), but Cliqz didn't really feel like an option to me because of quality degradation (and this is coming from a person who puts up with manually approving JS with uMatrix on each page I visit).
Yes, but once you have such a strong dependency it's difficult to remove it. Others have tried the approach and are still stuck with them.
Sorry to hear that the quality was not good for you, it depends on country to country (depending on the users-base basically). For Germany, quality was good enough, QA analysis on stratified queries backed it up. That being said, perceived quality from a person is not properly reflected on NDCG-like metrics, you do not remember the 9 queries it did right, but the one that was totally off.
In any case, DDG is good, and let me emphasize, they (and others) provide a lot of value to the users, privacy-concerned or otherwise. But the underlying problem is not getting fixed, unless, hopefully someday, they come up with an independent index (let's hope).
Brave is based on Chrome, whereas Cliqz is based on Firefox (just to be precise). Note that ownership of code is not the same of ownership of a service... if Brave is depending on Google services, then you would be right (what happens with the [meta]searchers. But the code is open, and can be forked at will (there are some caveats to that claim, licences, internal APIs, etc.)
You can collect data from users and still do not compromise their privacy, it's how you do it that matters, becomes a design requirement. Collecting a url visited, can lead to build a user history (privacy hazard) or not. It's an design choice. The whole mantra that data!=privacy is doing a lot of damage (for anyone curios we did publish plenty of material on the topic, https://0x65.dev/blog/2019-12-02/is-data-collection-evil.htm...)
> Note that ownership of code is not the same of ownership of a service... if Brave is depending on Google services, then you would be right
Unless Brave is prepared (i.e. has the necessary staff) to be able to independently develop their Chromium base without any help from Google whatsoever, then they are dependent upon at least one Google service - specifically, Google's development of Chromium.
> The whole mantra that data!=privacy is doing a lot of damage
No. The whole mantra that "privacy is possible when hoarding data" is what is doing damage. Every byte of data you collect is a liability - a privacy and security compromise waiting to happen. Even assuming your intentions were good and pure (which, as you might guess, I take with a hydrostatically-equilibrious and neighborhood-clearing grain of salt), even locally-stored analytics/performance data is a rich target for less-than-benign actors, and it's information that more often than not has no business being collected.
That is:
> You can collect data from users and still do not compromise their privacy
This is definitionally false. The very collection of data compromises one's privacy, by nature of it having been collected. Sometimes that compromise is necessary, but nothing Cliqz did seemed particularly necessary.
>> You can collect data from users and still do not compromise their privacy
> This is definitionally false. The very collection of data compromises one's privacy, by nature of it having been collected.
That's not definitionally false, if it sounds false to you is because you have an implicit assumption that does not apply.
Data from users does not imply user sessions on the collector side (session as a set of multiple data points belonging to the same user).
If sessions are collected, then, privacy is impossible to guarantee. We are well aware of that, having worked on this problems for almost 20 years. But that's precisely what Cliqz never did. All messages from our users are record-unlinkable for us, meaning that we have no way to reconstruct any session.
> That's not definitionally false, if it sounds false to you is because you have an implicit assumption that does not apply.
That "implicit assumption" is awareness of what "privacy" and "data collection" mean, and it very much applies (arguing otherwise is revisionist). Ergo: "definitionally false".
In particular:
> Data from users does not imply user sessions on the collector side
Yes it does, because otherwise collecting that data is pointless. Further:
> All messages from our users are record-unlinkable for us, meaning that we have no way to reconstruct any session.
Not if a malicious actor (which may or may not include a future or even current version of you) taps into the locally-stored tracking data. The very existence of that data and its collection thereof is a fundamental security and privacy risk. Just because you ain't currently siphoning it to remote servers doesn't mean malware can't do so, or that a "critical security update" can't reprogram the Cliqz browser/addon to do so.
That is: whether the aggregation happens client-side or server-side does not change the basic fact that the aggregation is happening, and that aggregated data remains a juicy target (and to make matters worse, even if you did want to safeguard that data, it's effectively outside your control). That very aggregation itself is therefore a violation of my privacy.
And this is all taking Cliqz' claims at face value. We could certainly dig further into how we're supposed to take your word that you are indeed discarding unique identifiers (including IP addresses). We could (and should) certainly do the same for other sites claiming to discard such identifiers, but given DuckDuckGo (for example) ain't in the business of peddling sleazy-looking adware¹ (to my knowledge at least), I'm at least slightly more inclined to take their word for it.
I'll give Cliqz credit for at least trying to address these issues in the hopes of finding a creative solution that gives advertisers what they want without egregious privacy violations, but - having read the papers before, and reading them again - I'm still pretty thoroughly unconvinced. I'd much rather not have tracking at all, like how newspaper and magazine ads work (barring some substantial leap in technology, newspapers and magazines never tracked my "engagement" with the ads within or how long my eyeballs were looking at them or how quickly I turned the page or what have you).
Goggles white-paper was released more than a year ago, long before Kagi was even announced to the public.
Additionally, before Brave acquired Tailcat (Jan 2021) I had the pleasure to share the draft of the paper with Kagi's founder.
So no, there is no prior art.
Let me add that I do not claim that Goggles is prior art of Lenses either.
One of the key features of Goggles design is that the instructions, rules and filters are open and URL accessible.
A Goggle is not so much a personal preference configuration, but a way to collaborative come up with shareable and expandable search re-rankers.
Very different goals if you ask me. Of course, Goggles can be used for personal preferences exclusively, but that's not the use case we had in mind.