People are conditioned to think Search has to work online.
This was true when Google was created.
No one had the processing or memory available on their desktop to search an entire index of the "useful" web.
Not anymore.
How large is a "useful" index of the web today? And can it fit on your laptop? The answer is yes.
Can the entire thing be queried fast? The answer is yes.
As an example take the entire stackexchange and wikipedia dumps in their entirety(including images). Compressed it comes to 50-60 GB range. Think about that number. That's an rough approximation of all known human knowledge.
It's not growing too fast. It has stabilized. To query the content you need an index.
So how large is an index to a 100 GB file? Generally around 1 GB. Let's say you use covering indexes with lot of meta data and up that to 5GB to support sophisticated queries.
With today's average hardware you can search the entire thing in milliseconds.
So why aren't we building better local search?
Because everyone is conditioned to believe, thanks to Google's success, we need to do it online. Which means baking in the problem of handling millions of queries a second into the Search problem. Guess what? This is not a problem that local search has.
Every time a chimp or a duck needs to build a protein in it's cell it doesn't query the DNA index stored in the cloud. Instead every cell has the index. Every cell has the processing power to query that index in the nanosecond time scale.
The cloud based search story is temporary.
If you want to index every reference to Taylor Swifts ass that every teenager in Norway, Ecuador and Cambodia are making, then yes you need a Google size index. But for useful human knowledge we are getting to the point where we don't need Google scale.
If you don't believe me look at what is possible TODAY in Dash/Zeal docset search for offline developer documentation search or Kiwix or with Mathematica.
I don't think search is as simple as grepping a ton of files. Search Google for "movie where people can't have babies" and the #1 result is Children of Men. I think that's one area where Google pulls ahead of its competition is accurate results to vague queries like that and I think you need a lot of collected data from searches to provide that.
For the record anyone who hasn't seen Children of Men, do yourself favour and find it. One of the greatest and most underrated films of the last decade.
Edit: Oops turns out its 11 years old. Better make that the last two decades then. I'm gettin old
The reason sites like Google and YouTube are good is not because you can search the larger sites well, but because they can search the long tail of the web well. Any search engine can index the top 5-10k sites and build something workable for that. The money maker is returning lesser known sites because, despite not being popular, they contain the best answer for the search query. YouTube got popular not because it has the most popular videos on the web (all video sharing websites have the popular videos) but because it contained and stored an insane amount of videos with like 100 views and would return them if you searched for something oddly specific.
I don't think search has much to do with YouTube's success. Just having all those long-tail videos is the main thing, and it has those videos because it's popular.
I've heard that a key early advantage of YouTube was that uploaded videos appeared immediately, not after a long processing delay. That helped it become popular, and once it was popular, it stayed popular and crushed the competition due to the network effect.
You vastly underestimate the amount of knowledge in random blogs and small websites. If you are only after the "mainstream" sites, sure, but that's quite the bubble. I want and need access to all the blogs, forums, wikis.
Sometimes, it feels Google is increasingly useless at finding those niche websites because all the noise has time to do search engine """optimization""".
I’m glad I’m not the only one to think that. I wonder what a good way to measure that signal to noise ratio is?
This feels like an application of the 80/20 rule. I might not always find what I need in just that offline index, but the times I do would seriously disrupt google.
Oh, I wish there were a way to rate the results in a opposite way. Maybe it's just me, but sometimes I want even mark some site as 'a hidden gem'. Usually it's something rather niche though, like a bunch of great little articles about anti-aliasing filters design and sampling, full of engineering wisdom from the decades of experience. So I think it would be perfect to have a way to tag it by some topic.
Google is pretty terrible at finding that stuff (as well as DuckDuckGo.) Instead, the first two or three pages of results are the SEO kings, often with identical copy. https://millionshort.com is a search engine that is entirely a reaction to that.
Why not download an index of the 100,000 or 1,000,000 sites which appear most often weighted by position? If they don't reach a certain threshold of quality (maybe the query is too obscure) then resend the query online?
I don't have a CS background so I'm not sure, is this a reasonable solution?
The obscure sites are what make the core of the internet, at least in my circles.
But having a way of making sure that a certain topic or genre of sites is well indexed locally would work. I.e. "my" top 100k. Oh, how awesome would that be.
Laptops generally come in 128GB or 256GB capacities. So we're talking roughly 20-50% of an average laptop's storage just to hold the search index for wikipedia alone.
Meaning it is not feasible to store a useful index of the web today on a laptop unless you want to dedicate that laptop to pretty much exclusively searching wikipedia's english content.
You're not searching that database in milliseconds on average laptop hardware, either, but you probably could get it to be "fast enough" for practical usages if you could somehow solve the harder parts like storage size, freshness, and indexing more than just wikipedia's english content.
That’s also assuming Wikipedia is worth having an offline index for. Wikipedia, for Bonn controversial, general knowledge has its value, but something gets lost with 1000 chefs in the kitchen. The beauty of the internet is that it isn’t Encarta.
Your idea doesn't go far enough. What good are a couple of stale indexes of Wikipedia and Stackoverflow going to be?
Instead, I am wondering why there isnt a federated open source search engine. It would be a cloud of nodes, each node spidering and indexing a small hash bucket of urls. With a million such nodes we could have a live updated, distributed search engine to replace Google. We could run millions of queries without paying someone - we'd pay back by serving as a node, just like in BitTorrent. With all the interest in privacy, I wanted to see more discussion of replacing Google with an open, non-censured and private protocol for search.
Who would host the index? You can't contact one million peers to run a search, so fanout-on-read doesn't work. If the index is also distributed (by search term?), then the search+indexing nodes will need to fanout-on-write, which is less latency-sensitive but still pretty onerous.
In the same vein, what good is an index of the web going to be? Googles value isn't their index, its how they translate your search terms into good results from that index. Anyone can index the web, not everyone can make sense of it from human questions.
> So how large is an index to a 100 GB file? Generally around 1 GB. Let's say you use covering indexes with lot of meta data and up that to 5GB to support sophisticated queries.
You vastly underestimate index sizes. I'm not saying it wouldn't be realistic to download and index locally but they'll be much larger than 1-5%.
So when there’s a new blog post on a topic of interest, how does that work? The internet isn’t an encyclopedia of a finite size.
Offline search is the equivalent of the Internet Yellow Pages from back in the day. New, relevant information is being added continuously. The search index 10 minutes ago is different than it is right now.
One aspect of this is trying to find content that has already been viewed. There are tools to archive web browsing locally for future re-use. Any improvements to local search would benefit this use case tremendously.
Is there a way to easily sync search repositories, like Stackoverflow, Wikipedia etc., to your local computer with an automatically built search index?
Stabilised != final though, so you'll still need a layer in order to find all of the new knowledge and compare it against what everyone has locally (not going to download 50GB every time someone posts a new useful SO answer), and update the local store. Why not keep that index centrally for everyone instead of replicating it for 2bn people?
How would you keep the index up to date? How quickly would news stories appear in it and stale links disappear? Are you going to read the entire internet every day to keep it current?
maybe you could pull down changes on a daily basis? not a complete reindex but just updates that prune dead links and add new ones.
the problem is this would be completely useless for current events. but querying sites you use regularly, this could be an option. I often end a google search with modifiers like "wikipedia" or "reddit" or "stackoverflow". If I could store indexes to those sites offline that might be useful.
The creator of this offline search could potentially make money by indexing shopping sites like Amazon, eBay, etc with affiliate links.
Or I could let someone do it for me; say Google or DDG. I don’t have time for such nonsense. I’m not going to waste time hosting my own version of Apple Music, when I could just use Apple Music.
If you struggle to get any of your indexing working, let me know and I can Google it, whilst you grep through your stale Stack Overflow archive.
Your reply seems pointless, yes I could download an entire copy of Wikipedia, compress it and index it - why would I want to? Are you going to do this on every machine you own? What about the updates Wikipedia receives every minute?
You should start a company though to focus on this, maybe you could call it Encarta or something?
Problem is that knowledge gets outdated quickly as things change, I usually have to specify 'only show results from the last month', else the information is not relevant. Having to constantly download an index would be demanding, and to be honest why bother? It's not like google is slow.
Yes, we can build a local encyclopedia. But from a 'Green' viewpoint, which is more environmental-friendly: more Terabyte hard-drives sold? or a few MB of data transferring over the internet?
So, practically speaking, how would one go about setting this up for local use? I would this would also work very well for locally stored books and docs.
This was true when Google was created.
No one had the processing or memory available on their desktop to search an entire index of the "useful" web.
Not anymore.
How large is a "useful" index of the web today? And can it fit on your laptop? The answer is yes.
Can the entire thing be queried fast? The answer is yes.
As an example take the entire stackexchange and wikipedia dumps in their entirety(including images). Compressed it comes to 50-60 GB range. Think about that number. That's an rough approximation of all known human knowledge. It's not growing too fast. It has stabilized. To query the content you need an index.
So how large is an index to a 100 GB file? Generally around 1 GB. Let's say you use covering indexes with lot of meta data and up that to 5GB to support sophisticated queries.
With today's average hardware you can search the entire thing in milliseconds.
So why aren't we building better local search?
Because everyone is conditioned to believe, thanks to Google's success, we need to do it online. Which means baking in the problem of handling millions of queries a second into the Search problem. Guess what? This is not a problem that local search has.
Every time a chimp or a duck needs to build a protein in it's cell it doesn't query the DNA index stored in the cloud. Instead every cell has the index. Every cell has the processing power to query that index in the nanosecond time scale.
The cloud based search story is temporary.
If you want to index every reference to Taylor Swifts ass that every teenager in Norway, Ecuador and Cambodia are making, then yes you need a Google size index. But for useful human knowledge we are getting to the point where we don't need Google scale.
If you don't believe me look at what is possible TODAY in Dash/Zeal docset search for offline developer documentation search or Kiwix or with Mathematica.