Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Those aren't proper audits. And again, bringing up the fact that it's open source is a meaningless piece of information since there is no way to verify it's the same software code on production. It only serves to trick the average user who doesn't understand how web servers work into trusting your service more.

The best thing you could do, if you actually care about privacy and not just $$$, is to open-source the entire search index db and accompanying webserver software, making it easy for users to setup their own local instance of DDG which is actually auditable. Additionally, posting a notice on-site which notifies your users that their searches may be recorded and tracked in spite of what the privacy policy says(due to the USA jurisdiction of the company making it susceptible to National Security Letters and secret gag orders) would be the right thing to do.



> open-source the entire search index db and accompanying webserver software, making it easy for users to setup their own local instance of DDG which is actually auditable

Easy to self-host? How large do you suppose the Bing index is, for example? Simply storing the index would be an immense undertaking beyond the reach of probably everyone who has ever self-hosted anything, ever. This ignores the compute required to actually search it, as well as how it would get updated.

I'm not sure your request is remotely reasonable.


I was curious, so as a point of comparison, the latest Common Crawl [0] is 3.1 billion pages and 370 TB uncompressed. I would presume that Bing would be significantly larger given commercial interests.

[0]: https://commoncrawl.org/connect/blog/


If somehow Google and AskJeeves worked perfectly fine 20 years ago for millions of monthly users, I find it hard to believe a modern powerful computer lacks the resources to support a search engine for a single person.


What is the largest hard disk one can buy nowadays? I found a WD Gold 20TB. You'd need 19 of them plugged into your computer just to hold the uncompressed archive from Common Crawl.


Yet somehow search engines like Google and AskJeeves existed and worked alright 20+ years ago on hardware 1/1000th as powerful as it is today.


firstly Google was founded in 1998 that is 23 years ago.

Secondly from 2000 - 2018 the internet went form having ~17.000.000 unique domains to having ~1.600.000.000 unique domains. see: https://www.internetlivestats.com/total-number-of-websites/

The performance for desktop computers have actually not increased as much as you would think: https://www.karlrupp.net/2015/06/40-years-of-microprocessor-...

Your assumption is correct if you look at supercomputers, where the fastest in the world in 1999 could produce ~2.3 TFLOPS and in 2018 it could produce 122 PFLOPS which is around 5000 times the increase in FLOPS.

But i doubt most of the people you would want to go through this index has access to a super computer.


I wouldn't be surprised if the indexed subset of Facebook alone were more than 1000x larger than all of the indexed web 20 years ago. The web in general has probably expanded many millions or hundreds of millions of times.


Personally I wouldn’t mind if trash/spam sites like Facebook/Twitter were omitted from the database. As well as non-English content, being as though I only speak English. Remove trash/spam/non-english from the db and the size of that 300TB will be cut down substantially to the point it is feasible for a single person to store. After all, even if somebody wanted to store the whole 300TB db would cost about $4000 in hard drives which is not as totally out-of-reach as some people here are making it seem.


I think the web didn't have the same amount of websites 20 years ago...


That was a very different internet. Search engines aren't something you build once and then you just have them. Constant, extensive work is necessary. It's quite literally a global-scale task to do this effectively.


> Those aren't proper audits. And again, bringing up the face that it's open source is a meaningless piece of information since there is no way to verify it's the same software code on production.

> The best thing you could do, if you actually care about privacy and not just $$$, is to open-source the entire search index db and accompanying webserver software, making it easy for users to setup their own local instance of DDG which is truly auditable.

self hosting isn't feasible for 99% of the population. DDG is aiming to be the mainstream privacy protecting search engine, I used them for a while and can appreciate their efforts. if you want something nerdy and and self hosted use a searX instance or host it yourself.


>self hosting isn't feasible for 99% of the population

Its only this way because companies have a vested interest in keeping it like that. It's how they make their money. It is absolutely within the realm of possibility that people host their own search engine. 99% of people know how to install Google Chrome right? this should be no different. The entire search engine & webserver stack it depends on could be bundled into a .exe/.app installer with simple instructions people can understand. Consider XAMPP- which already provides a webserver stack that is extremely easy to install on Windows/Mac just by a simple .exe/.app that 'just works'. This hypothetical search engine could use similar methods as the XAMPP installer. There is no technical reason why this can't happen. It just isn't happening because it'd increase competition, cutting into DDG's profits.


Sure, the problem with installing a local search engine is the installer technology. It can't be the petabyte of index information that the search engine actually needs, and the petaflops of CPU it would need to search through it.

Everyone has a PB of SSD disk space, some few TB of RAM and a few thousand CPUs to throw at the search problem, or is happy to type in a search query and give a 16 core CPU a few days to execute it, right?


> or is happy to type in a search query and give a 16 core CPU a few days to execute it, right?

That is just a naive implementation. For the first 10 results you grab ads, the database of those is significantly smaller, for the next 20 results you look at Wikipedia and stackexchange clone sites. Everything after that is indexed using math.random(). If you want to get fancy run the query through a fact creating AI and present the results inline, people are always happy to know that the color of the sky is purple or that the ideal amount of chess players is 5. Disclaimer: I have never seen googles source code nor any patents related to it, any similarity with existing search engines is pure coincidence.


I don’t know why you are framing this as an impossible task. It doesn’t need to be on the scale of Bing/Google to function. There are already some self-hosted search engine solutions that work okay. Just filter out all the trash sites with low quality content like Facebook/Twitter from the database and that 300TB common crawl could probably be cut down to a more reasonable 200TB. Filter out non-English results and it probably halves it further. I’m seeing 8TB drives on Newegg for $129. It absolutely does not take anywhere on the order of “days” to query a properly optimized db of this size.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: