That Cliqz is trying to actually build a new search stack is commendable. This is way more exciting to me than DuckDuckGo and other services that just package up Bing search results under different branding.
I'm skeptical that they'll be successful, but I wish them the best. They should market (and engineer) strongly on privacy since that's where Google is weak.
And the "four hundred sources" link links to 400 special case replies. They are probably useful, but fire rarely. It's basically Bing, and that page is a bunch of spin.
I played with Yahoo BOSS a lot as a undergrad and I could tweak it to get "better" results than Yahoo Search for certain queries. That Bing and DDG have different results doesn't really say anything.
I don't see how such a deal could possibly work. Microsoft's entire history and culture has been devoted to squashing nascent competitors whenever possible. So why in the world would MS license Bing to DDG? If DDG can monetize that traffic more effectively than Bing by not tracking (which seems unlikely) then why does MS just stop tracking on Bing? And if they can't, then how would they afford to pay the license fee, which surely must be at least as high as the lost revenue from the diverted traffic?
I imagine few people using DDG would use Bing instead, so there is little income loss from people moving Bing->DDG. If the choice is "provide DDG with a search API, get increased reach for your ad-program through an audience that otherwise wouldn't touch bing" vs "do not provide DDG with a search API, either a competitor does or DDG is way worse and everyone uses Google", why is the latter the better choice for MS?
Make Bing tracker-free, loose (assumed) income from less precise targeting of all the current Bing users, just to maybe capture a part of DDGs market, large parts of which are going to be distrustful of Microsoft? Seems not obviously better than keeping Bing as is and being involved with DDG.
I didn't make myself clear. I didn't mean to get rid of Bing. MS could keep the Bing brand exactly as it is, but also present the same product under a different brand that competes directly with DDG. In other words, MS could be DDG. Never in its history has MS tolerated the existence of a competitor if there was something they could do about it, which in this case they obviously could. Why start now?
They would not have to pretend. Marketing a separate brand with its own identity is a common practice. How many people who stay at a W hotel or a Ritz Carlton or a Sheraton are aware that these brands are actually owned by Marriott? How many people who stay at a Waldorf Astoria know that it's owned by Hilton? These are not secrets. 80% of the world's economy is controlled by fewer than 1000 companies [1].
And then a few days later see a HN top story that NewPrivateSearchEngine is a secret Microsoft conspiracy to destroy the world? I doubt it would go over well.
If you're going to indulge in that level of paranoia, how do you know that DDG is not itself a secret Microsoft conspiracy to destroy the world? If you don't know the terms under which DDG licenses Bing's search results, how do you know that those terms don't give MS complete control over DDG?
I think I was too snide in my comment and it wasn't clear.
I meant that if Microsoft launched a privacy-focused product and hid their involvement with it they would receive extremely negative publicity on the sort of websites people who use privacy-focused products read.
Completely leaving aside the harm that would cause to the Microsoft brand, it would also be completely useless. Essentially no one would switch from DDG to Microsoft's (hypothetical) shady clone.
Your hotel examples aren't relevant because people searching for hotels and people searching for private search engines don't evaluate details about corporate ownership the same.
Yes. Do you believe that a HN discussion of a company owned by Microsoft that doesnt draw attention to that and which markets towards privacy-focused users would he received positively?
Do you believe that an HN discussion has the potential to move the market share needle for any MS product? More to the point, do you believe that Microsoft thinks this?
If that product is aimed at privacy-focused users, yes.
We've gotten so far into unlikely hypotheticals I don't find this conversation interesting any more. I won't reply further in this chain. Have a nice day!
If I recall, in the very beginning they were massively using Yahoo and to some extent Bing. It might be that over the years Bing has become a more reliable source.
By selling ads solely against the query, and figuring out a way to track conversions against generated unique referring URL instead of using cookies. Maybe it's less profitable, but it might still be profitable, and it gives you a foothold.
Or a sponsored and unsponsored section. The only thing you pay for is for being higher up in the sponsored pane. No need for tracking. No conversion clicks tracked. Only higher on the sponsored listing.
Wouldn’t that prevent people with less money from buying their way up to the top? In other words, the ones with the most money (not startups, mom&pop shops, etc) would be the only ones at the top.
If one didn't mind breaking even rather than making money one could create a search engine which is free software and have groups/people host their own instances.
Marc from Cliqz here: it’s something I’m thinking about a lot since years, it’s a very clean model. We even did some tests, but unfortunately the majority of people is not like hacker news. What is a reasonable price point for you per month? How would you pay (ie would you feel comfortable giving us your credit card and create an account? (I don’t think crypto is big enough yet to solve this)
Certainly it can be done. Cuil did it with about 30 people a decade ago. No business model, they ran out of funding, and no one wanted to acquire them, but they did get the crawler and search engine running and publicly available.
It might be worthwhile to do the search part in-house and outsource the question-answering functions. Wolfram Alpha and IBM Watson could be used for answering common questions.
Hi, I read the thread and thought the answer was good enough, but it seems that you are not yet convinced. Let me try:
1) Here there is a list of publications regarding privacy by Cliqz (including published scientific papers). It should have fairly easy to find it using a search engine :-) https://0x65.dev/pages/dissemination-cliqz.html
Hopefully, the paper will convince you that Cliqz privacy commitment is serious.
2) Feel free to monitor your own traffic to see whether or not we are tracking you.
3) Honestly, if someone tells you that anolysis means anonymous + analysis, why do you not believe it? It does not take long to find references of the name on the source code. On a separate note, as a company (Cliqz) that offers anti-tracking and ad-blocking, I can tell you that blocklists are a bit more sophisticated than that.
A quick skimming of those PDFs found no mentions of "anolysis." Your colleague claimed that papers were going to be released "soon" on it. It's been at least two years since you started using it, so why hasn't there been yet?
It does not take long to find references of the name on the source code.
No references in any of your published source code, and your search engine isn't free software:
Honestly, if someone tells you that anolysis means anonymous + analysis, why do you not believe it?
Cliqz has done unsavory things in the past (like the Firefox fiasco a few years back, for example, which I can't fault Cliqz entirely for: Mozilla is just as guilty).
On a separate note, as a company (Cliqz) that offers anti-tracking and ad-blocking, I can tell you that blocklists are a bit more sophisticated than that.
"anolysis" gets around both uBlock Origin and uMatrix, despite both of them automatically blacklisting any URL with "analytics" in it, as an example. Getting around the most popular content filterers on the internet is a pretty strong signal.
> Cliqz has done unsavory things in the past (like the Firefox fiasco a few years back, for example, which I can't fault Cliqz entirely for: Mozilla is just as guilty).
Not sure how this is Cliqz' fuck-up. We are not hiding anything. On the contrary we are very transparent and detailed about how everything we do is designed to not track users. All of this is on our new tech blog: https://0x65.dev, feel free to have a look between two comments on HN and give us some feedback!
> "anolysis" gets around both uBlock Origin and uMatrix, despite both of them automatically blacklisting any URL with "analytics" in it, as an example. Getting around the most popular content filterers on the internet is a pretty strong signal.
It's not called "getting around it" when there is no tracking or ads going on (if you want to see how smart the "most popular content filterers are" check out this link and see that the image is blocked because it contains the substring: "analytics": https://whotracks.me/blog/private_analytics.html. Wicked smart!).
Anolysis is not a typo, it's a project name, people tend to do that when they care and spend a lot of time on projects: give them names. So, at the risk of repeating myself, Anolysis = Analysis + Anonymous (at the time we thought it was a pretty neat name!).
Anolysis does not operate outside of Cliqz products (no websites analytics here and we do not rely on a third-party, we built it in-house for this reason) and we put a lot of work into it to make sure it does not use a unique ID (like virtually every other analytics out there) but allows to by-design not track any single user (in fact the system does not even have the concept of a user). Sure, we did not write extensively about it but I guess we have to start somewhere (in December we are writing on 24 different things we do, we will be sure to consider Anolysis as a good candidate for a technical blog post in the future).
What you attribute to malice is simply a lack of time, as you probably noticed Cliqz is working on solving a lot of very hard problems (search, browsers, antitracking, adblocking, privacy-preserving telemetry and so much more) and writing a paper about the new system you designed and implemented is not always the priority :)
I'm one of those people who remains skeptic about the anonymity of the tracking Cliqz does in general. Obviously people have a hard time believing any company that is in the advertising space and preaches about privacy, mainly because they have been burned several times before.
For me it was the data proxying through FoxyProxy that made me uncomfortable. I have also remained unconvinced about the motivation for not using Tor: because it is hard to integrate into extensions. Cliqz also has its own browser, not just an extension, where they could opt to use Tor, but they route through FoxyProxy.
You must find a way to make it technically impossible to identify users, a legal or business structure is not enough. It wouldn't be unheard of to secretly own the proxy company and relink user data.
Well, we released our beta search for Tor yesterday: search4tor7txuze.onion/ (works obviously only in the Tor browser). That’s as good as it gets regarding making it technically impossible, isn’t it? More complex obviously for a browser - but the answer can simply not be „only no data at all is good“, because that’s a destructive approach that only favors the worst privacy intruders; no one would then be able to build up competition. We work hard to be as transparent as possible about what we do. Show me any other company that builds a big data product (like search) that is so transparent about „yes we collect data, but it’s non personal - here is how we do it, please scrutinize us“. Are we perfect? Hell no! But we try and we go a long way to be challenged to improve. If we were shady - would we be naked in front of you, showing each step we take? We would not even interact with the tech folks (especially not on hacker news, where people really know what they talk about), but scam people who know less (at least that sounds like a more reasonable strategy to me if I would want to fool people - which we don’t).
We appreciate your openness about how you collect data, but that's still not enough because literally every other advertising company is deceptive when they talk about privacy.
The openness must be paired with privacy that is guaranteed under all circumstances, and the most common way to achieve that would be to route the anonymized data through Tor.
Your search engine being also available on Tor has nothing to do with data collection by Cliqz on other sites. Your search engine website is not the avenue through which the Cliqz browser and extension collects data as you browse the web, I'm confused why would you even bring it up.
Would be really interesting to know your concerns with FoxyProxy.FoxyProxy is legally bound to not log the IP or share it.
From the HPN protocol's perspective, data can be routed via any trusted party - in our case it's FoxyProxy.
Right now, there is no way to configure this in the Browser, but should be doable.
It's actually one of the motivations to move to the newer version of HPN[1].
We do agree that sending data through Tor network is the gold standard for anonymity.
- We did a lot of work on getting Tor running in Cliqz
browsers. It's a hard problem but definitely do-able, something we might pursue again in future[2].
- We also have experimented with WebAssembly version of Tor client to make it compatible for web extension[3].
Having the ability to use the Tor network in Cliqz products is also good, because we can actually leverage the anonymity guarantees by sending data via .onion services.
You can also check more details under evaluation section of the paper[4].
In case you wish to check the network traffic you can also check the debugging section[5].
You say you're copying Google and you use that to promote your product, it's very hard to believe you are privacy friendly.
You're also owned by a media company, that makes it even harder to believe that you're going to respect users privacy.
Add to that the tone of arrogance of articles such as "the world needs Cliqz", you can see why it's a no-no.
Every few days there is a post on HN trying so hard to convince the readers that Cliqz is the best, even though the articles read between the lines, that the Cliqz team does not have the capability to make its own search algo or make slightly more complicated queries.
I am experienced enough to know where this is coming from: managers that do not know what they are doing and engineers drunk from glory that do not see their own mistakes.
Please Cliqz hire a search engine expert. Hire great engineers, they are going to cost twice, but you're going to get a search engine that actually works.
Please, or the HN community will have to bash you every time you post an article.
The article describes techniques used by all search engines, not just Google. Search engines have existed before Google, and despite Google's monopoly on search, "Google clone" is a poor term to describe all search engines when alternatives with unique features (e.g. DuckDuckGo) exist.
All search engines have search-engine results pages.
They're looking at how users use Google Search because the data's there. They're making a competitor to Google Search. That doesn't mean they're rebuilding Google Search's SERPs, or making a Google Search “clone”; I've got results from Cliqz for queries I'm confident have never been put into Google before, meaning it's functioning as an independent search engine.
>They're looking at how users use Google Search because the data's there
This. Having worked in past life for one of their competitors, can confirm - what users click on (in SERP) is one of the most powerful signals for ranking. And who got (almost) all the clicks in the world? Google!
That's why it's so damn hard to beat them. It's the unreasonable effectiveness of data: more data (which they have almost all of) usually beats a smarter algorithm, and with 20 years R&D, theirs is surely not dumb.
Do the clicks belong to users or Google, that's an interesting question, though.
It isn't clear to what extent Qwant actually makes their own results.
In 2013 they said they were temporarily using Bing results they purchased as "training data" [wikipedia].
They haven't given any updates on that I could find (in English, at least), but they recently proudly announced they have 20B pages in their index in an article on their partnerships with Microsoft [betterweb]. Google's index is 1500 times larger, so I'm not sure how competitive their own index is [goog]. And, if they no longer needed to rely on Bing results, wouldn't they announce that?
And even if it does work for a while, there still needs to be the original signal to copy
I'd say they need to start somewhere. Using other search engine's results is a ways to get things started, so that they can build their own index on crawled content later.
For now, it's great to see another competitor for Google coming up.
Indeed, using the (query -> result) mapping in their 'Human Web' logs, from other search engines, is a roundabout way of copying both the simple word-to-doc-index, and the complicated ranking logic, of other search engines.
A sufficiently strict intellectual-property regime might find this a copyright violation. But without any inside info, I strongly suspect Google & other incumbents already do similarly-indirect modeling of their competitors' behavior, via extensive query/click-trail mining, in ways that ultimately feed into improvements of their own systems. So, they might not want to press the issue.
Still, this creates a dependency on the competitor you were hoping to displace, where most of your earlier values comes from "drafting" in the easy-path they've already cleared.
Copying, learning would be a bit more precise, and I'm not kidding. We do not answer queries 1 to 1, query-logs are used to build a more concise model of the page. It's not just a cache, we started like this, but we quickly learn to answer unseen queries.
A sufficient strict IP might find text snipped a copyright violation too. It's a trick area. Personally, I'm at peace as we get the content of web pages, as everyone else.
As for the dependency, it would be if we were not able to generate our own synthetic queries, which we are. So even if all other search engines of the world were to disappear, we would still be able to operate. That was not always the case, as you pointed out.
> A sufficient strict IP might find text snipped a copyright violation too. It's a trick area. Personally, I'm at peace as we get the content of web pages, as everyone else.
Interesting you bring this up. Wasn't the company funding cliqz in favor of text snippets being copyright violations when the big G does it? (German/EU Leistungsschutzrecht) [0]
I mean, I think bootstrapping from google query logs is fine, but following the money it does seem like a double standard. Cliqz "stealing" from Google SERPs is fine, but Google news "stealing" text snippets isn't?
(Relevant quote: Das Leistungsschutzrecht halte man nach wie vor nicht für falsch. Man setze sogar weiterhin auf ein euop. Leistungsschutzrecht, mit dessen Hilfe man sich erhofft, endlich Geld von Google zu erhalten.
sloppy translation: We still consider the [we-want-money-for-google-news-snippets-law] to not be the wrong approach. We in fact contiue hoping it will work out on a european scale)
[Disclaimer I work for Cliqz] This is 100% personal opinion, not a spokesman neither for Cliqz nor Burda.
As all German media? Yes, it seems that they were lobbying for it, not clear if they still are part of the consortium or not. Why? No idea, really. Perhaps they are diversifying, lobbying on one hand, trying to build a competitor with another. But to make an assessment of the "goodness" of their intentions I prefer to stick to the facts: besides complaining to regulators, or not, they do fund a potential alternative. That's very commendable, cannot name many companies that are crazy/adventurous enough to put a ton of money on something as risky as what Cliqz is trying to do. To sum up, if the lobbying is a minus, the building is a massive plus, a clear positive outcome IMHO.
Being successfully might not mean being at the top of the pyramid. DDG is a an example of a successful search engine.
Further, they don't have to be better than google in the quality of their search results. As soon as the results have good relevance, it's good competition.
What matters is that they are 'good enough', that there is a hint of competition to the Google-Bing monopoly. Offering privacy centric 'competition' is what they say their main aim is and what their success should be judged upon.
DDG absolutely is a search engine; It’s a system (engine) that lets you search. From Wikipedia[0]:
> A web search engine or Internet search engine is a software system that is designed to carry out web search (Internet search), which means to search the World Wide Web in a systematic way for particular information specified in a textual web search query.
Just because they don’t do the crawling like Google and Bing doesn’t mean they aren’t a search engine.
This is tautology. In that case, everyone can just put up their own html page with a search box, proxying the results from Google or Bing, and can claim that they have built a search engine ...
Google search worked well enough for me when it was launched 22 years ago. So the patents to that early version should be expired. I'd be happy to use a competitor's service if they would reintroduce a Search API, and PageRank, just give me a programmatic interface that doesn't throw up Captcha after a few searches. I'd pay a reasonable price for the compute time + profit margin.
If Google went back to the original algorithms it would return even worse results. Much of the noise can be attributed to two factors: (1) the explosive growth of the Internet itself, which was much much smaller and had more focused and authoritative content 20 years ago; and (2) aggressive SEO tactics today that would easily fool the early versions of PageRank.
The point about a Search API is that it opens up the problem to a world of 3rd party developers, as opposed to Google putting up competitive defenses by restricting programmatic access. A few years ago I built a web directory that organized and filtered Alexa's top 1MM websites, using PageRank, semantic clustering, and natural language processing. It worked really well, gave me a holistic picture of the internet that was quite different to how Yahoo presented it through their directory. However building it was time-consuming because Google would keep shutting out the ipaddress I was using, so I'd have to keep driving to the next Starbucks up the street to resume my scripts.
Now compare the world of search to the world of mobile phones. Imagine if mobile phones only came with proprietary apps, and there were no app stores. That's where search is right now.
I would strongly disagree. I was at a dinner party in the mid-90s and we had a long-running discussion about how impossible it was to find anything using Altavista and the other search engines. Yahoo's directory was even worse.
The best idea anyone had, which I thought wouldn't scale but didn't have better ideas, was implementing a keyword registry à la AOL.
I'd never heard of Cliqz before, but just did a couple of test searches, and I'm honestly super impressed with the results. I found the result relevancy seemed to be closer to Google than DDG/Bing
Maybe so, but honestly, the name is putting me off more than anything. It just doesn't sound professional, and causes me to perceive it as shady, even if it's not. The name also makes it sound like it's more about marketing "clicks" to advertisers than providing good results. None of that is necessarily true, but it's the impression the name gives. It needs to re-brand.
It’s the one time where I can honestly say: The discussion we’re having within Cliqz about our brand name are even more heated and controversial than here on Hacker News (and Reddit for that matter) ... but then there is this saying: “Every brand name is shit until you surpass one billion users - than it becomes brilliant”. More seriously: we do think about it a lot - happy to get ideas.
Thanks for responding. Honestly, knowing that there are people behind it who read HN and are humans lends it credibility. Glad to hear there may be a viable competitor to google, and I'll check it out. Mostly because the big g has started demanding re captchas on normal searches from my VPN IP, which is clean and I have owned for years. Some of the marketing I've seen mentioned, "AI-powered anti-tracking technology"; could you please elaborate on what that means?
This is our anti-tracking tech we include in the Cliqz browser and also in the Ghostery Extension. We have another post lined up in the series on this (in a couple of weeks), but it's also been described in our blog posts[1] and our 2016 paper[2].
Well there are people who believe this and others who believe that ... as I said - very strong opinions (like yours) on both sides of the aisle even internally. Interesting proposal with the Nigerian Inc. - if we ever launch an email service or - better - a Spam filter, I’ll make sure to name the company Nigerian Princes Inc.!
In 2003, a friend told me she ignored Firefox because "it sounds like it was named by an eight-year-old boy". My response: who cares? Why does that matter in the slightest? Is "Internet Explorer" really better? Would you prefer a search engine called "Web Searcher Site"?
That may be true, but Google started when I didn’t have similar preconceptions based on name yet. The trend of calling something Trackr, Suprnova etc. didn’t exist.
Now any service with a name like that seems shit compared to more readable/writeable names: Stackoverflow, Quora, datadog.
Case in point is srht.co, which rebranded to Sourcehut for similar reasons.
Replacing the 'ol' suffix in 'googol' brings 'google' much closer to common English phonetics. These could all be pronounced the same but would have been a harder sell:
Googol can help:
"Bezos instead named the business after the river reportedly for two reasons. One, to suggest scale (Amazon.com launched with the tagline "Earth's biggest book store") and two, back then website listings were often alphabetical."
I really just wish exact match search still worked. But now, words are all vectorized as every search engine tries to determine "my intent", resulting in a deluge of fuzzy matches.
Maybe I'm old school, but I don't want software that fixes my spelling mistakes. I want software that fails when I make a mistake.
Honestly, what would be interesting is if there was an open source database of crawled webpages, available for anyone to search / use with their own algorithms. That would make it possible for... a lot of things, really.
I feel like the web parsing / indexing, perhaps rather than the search algorithm itself, is the hardest part of rolling a new search engine (largely due to the associated computing and storage costs).
There is https://commoncrawl.org/, but it would be really cool if there were a more well-lit path towards building the rest of a simple search engine. For example, another commentator wanted “like Google, but without the spelling correction”, well, spin up one of these and just stub out the spelling module :)
If you’re listening Cliqz, I (and I believe others) would pay $50-100 a month for a search engine that gave me the ability to blacklist sites. I think this alone could solve the biggest problem I have with google which is the reversion to the (what appears to be regressive) mean of the internet today.
That is a good ideas. If we can do it in a privacy-preserving way, which seems doable, we will most likely do it (although we would be constrained by resources available)
How do search engines stay out of trouble with things like copyright trolls and FBI-operated honeypots for child porn? If I had a private Web crawler, I'd be terrified to run it.
In our case the "queries" are also the index creation components. Every time someone discusses something, we are indexing it, so you can search media, documents, people from context. We hint at how this works here:
https://austingwalters.com/fast-full-text-search-in-postgres...
The downside of our approach is it needs lots of conversation data. From their TLDR version:
"""
- Our model of a web page is based on queries only. These queries could either be observed in the query logs or could be synthetic, i.e. we generate them. In other words, during the recall phase, we do not try to match query words directly with the content of the page. This is a crucial differentiating factor – it is the reason we are able to build a search engine with dramatically less resources in comparison to our competitors.
- Given a query, we first look for similar queries using a multitude of keyword and word vector based matching techniques.
- We pick the most similar queries and fetch the pages associated with them.
- At this point, we start considering the content of the page. We utilize it for feature extraction during ranking, filtering and dynamic snippet generation.
"""
It appears 0x65 has similarly figured this out, the name of the game is forming proper search queries. In their case, their results would be good as soon as they start indexing and create synthetic queries. IMO might be better for documents and what not.
Either way, interesting to compare notes! Kudos to the work.
If you're ever looking for something to write about for a new blog post, I would love to learn more about how you implemented that estimate_count function.
We've had at least 2 posts from cliqz in the past few days. I have genuine issues with my short-term memory after having recovered from a coma so I don't know whether this is a glitch where I keep seeing the same posts or they keep getting reposted.
This is a problem for HN because users here are not used to this sort of repetition—indeed, we moderate HN explicitly to dampen repetition, because the point of the site is curiosity and curiosity withers under it (https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...). The more these posts show up with what for HN is a crazy frequency, the more likely users here are to experience it as a barrage and start to complain.
On the other hand, these articles are well-crafted, contain a lot of information, and would normally be fine HN submissions. The topic of building a new search engine is intrinsically interesting. It also resonates with a lot of themes that get discussed a lot on HN (concerns about big tech and so on). So this is a different situation than the usual marketing onslaughts that HN gets subjected to, where the content is crappy, users flag it away, and moderators squash what users missed.
>> "Our model of a web page is based on queries only. ... This is a crucial differentiating factor – it is the reason we are able to build a search engine with dramatically less resources in comparison to our competitors."
>> "The total size of our index currently is around 50 TB."
Could you share your current index size (number of pages, size of raw text) to put those 50 TB into perspective in order to get an idea how much less resources in comparison to your competitors you need? This would help to compare your approach to Elasticsearch, Solr, Lucene
But is Cliqz a search engine, or is it a web browser with a search bar set to a default search engine (or a Firefox extension that accomplishes the same thing).
Cliqz bought Ghostery to acquire a pool of privacy-conscious users. The goal is to show them ads. Not sure how excited they will be about that.
If Cliqz really is a search engine, can a user submit a query to the database using her own choice of tcp/http client. It looks like submitting requires first downloading and installing software from Cliqz.
Do you use Common Crawl? It seems like a pretty big corpus, reasonably up-to-date and free. It seems like a good way to supplement the page data from the browser extension.
"It may seem like Common Crawl would suffice for this purpose, but it has poor coverage outside of the US and its update frequency is not realistic for use in a search engine."
But I doubt you really need it all in RAM for a small localized single user search engine. In that case you back most of the index with NVMe which at the prices I saw on black friday is probably less than $10k using consumer grade QLC flash drives when combined with cheap x1 pcie expansion boards/etc.
The 10PB of disk is also quite reachable given its possible to buy bulk 10T disks at $150 each.
Bottom line, I did some of these calculations a couple years ago because I was interested in a topic based search engine that only indexed for certain topics and basically tossed any crawler results that didn't appear to fit the subject matter.
So, while the web is a lot bigger than when google started, storage and compute is also a lot cheaper. A web search engine that specialized in say cooking recipes might be entirely doable on a fairly limited budget.
Can someone comment on how to use knowledge graphs for search? I have seen some applications in NLP but I am curious how it can tie in with traditional search.
AFAIK, in Google’s case instant answer cards come from the Google’s knowledge graph, not search results. I.e. if you see some info rendered on top of search results or on the side, most likely it’s from the knowledge graph.
You can use a knowledge graph for query expansion. For example, if the query is "Dan Dan Noodles", you can expand that to "Asain Noodles", "Chinese", "Tan Tan Noodles" to achieve higher Recall in your results.
Thank you for bringing this up. Although, this is not relevant in the context of the blog post, we are on that list by mistake. We do NOT collect any personal data in our browser: (more details e.g., here: https://0x65.dev/blog/2019-12-02/is-data-collection-evil.htm... and https://0x65.dev/blog/2019-12-03/human-web-collecting-data-i...) and we go a long way to make sure not even implicit indentifiers go through. We believe we ended up on that list for a bad Firefox experiment and we will reach out to the maintainers, make our case.
google's suggested autocorrect is one it's most impressive features; idk I'd say the relevance of the search results almost comes in a near second to that.
so make a competitive "suggested autocorrect" solution and then I think you'd have a stew going.
Data release, it's not possible, but if people want to come and do experiments on the data or try to test it for privacy, we are more than welcome to host them. There is no formal process in any way, best effort, we have done several times in the past. If you are very interested contact us and we will see if we can accommodate you.
(Disclaimer: I work at Cliqz) Extending on that, let me elaborate why we cannot open the data, not even a subset of it. We had the discussion in the past, but for two reasons it is not an option.
Although it is anonymous data - currently we are not aware of any de-anonymization attacks - it is still data that came from real persons. We have a responsibility: once the data is out, we have to guarantee that no-one will ever be able to identity a single person in the data. Take also in account that attackers can combine multiple data sets (Background Knowledge Attacks); that even includes data sets that will be published (or leaked) in the future.
You should never be too confident when it comes to security, neither should you underestimate the creativity of attackers. What we can do - and did in the past - is to simulate the scenario in a controlled environment by hiring pen testing companies. If they would find an attack, they will not use that knowledge to harm the persons behind the identities that they could reveal.
That is the main reason. We don't want to end up in a situation as AOL or Netflix when they published their data. By the way, Netflix is an example of a background attack where they needed to combine data sources.
There is also another argument. Skeptics will most likely remain skeptics, as we cannot proof that we did not filter out data before publishing. In other words, there is nothing to gain for us, we can only loose. Trust is important, but for building trust, it is better to be transparent about the data that gets sent on the client. You can verify that part yourself and do not have to rely on trust alone. That is the core idea behind our privacy by design approach.
Those are the arguments that I'm aware of why we will not open the data. However, getting access in controlled environments is possible. If you doing security/privacy research, you can reach out to us. In my opinion, having more people that will try to find flaws in our heuristics is useful. That gives us a chance to fix it before it can be used for attacks.
One notable exception: https://whotracks.me is built from Human Web and all its underlying data can be freely downloaded. We know that it has been already used for research.
I'm skeptical that they'll be successful, but I wish them the best. They should market (and engineer) strongly on privacy since that's where Google is weak.