People are conditioned to think Search has to work online. This was true when Go...

crummy · on Sept 20, 2017

I don't think search is as simple as grepping a ton of files. Search Google for "movie where people can't have babies" and the #1 result is Children of Men. I think that's one area where Google pulls ahead of its competition is accurate results to vague queries like that and I think you need a lot of collected data from searches to provide that.

fredsir · on Sept 20, 2017

FYI: For that exact query, DDG shows Children of Men as the top two results.

Kiro · on Sept 20, 2017

For me it's #6. Another movie called Baby Mama is #1.

i_cant_speel · on Sept 20, 2017

Why would it be different for two different people if there isn't any tracking going on?

tinus_hn · on Sept 20, 2017

Not saying it is here but it could be dependent on your country.

scaryclam · on Sept 21, 2017

CoM is #3 for me with region set to UK.

I get Baby Mama as #1 as well, but at least, from the movie synopsis, it doesn't appeat to be an irrelevant result.

fredsir · on Sept 20, 2017

My region is set to Denmark.

felipelemos · on Sept 20, 2017

Third here. Baby Mama is first for me too.

roryisok · on Sept 21, 2017

For the record anyone who hasn't seen Children of Men, do yourself favour and find it. One of the greatest and most underrated films of the last decade.

Edit: Oops turns out its 11 years old. Better make that the last two decades then. I'm gettin old

ian0 · on Sept 20, 2017

That result is pretty spectacular! I wonder if this is something general or related to their seemingly bespoke categorisation of movies.

skizm · on Sept 20, 2017

The reason sites like Google and YouTube are good is not because you can search the larger sites well, but because they can search the long tail of the web well. Any search engine can index the top 5-10k sites and build something workable for that. The money maker is returning lesser known sites because, despite not being popular, they contain the best answer for the search query. YouTube got popular not because it has the most popular videos on the web (all video sharing websites have the popular videos) but because it contained and stored an insane amount of videos with like 100 views and would return them if you searched for something oddly specific.

iainmerrick · on Sept 20, 2017

I don't think search has much to do with YouTube's success. Just having all those long-tail videos is the main thing, and it has those videos because it's popular.

I've heard that a key early advantage of YouTube was that uploaded videos appeared immediately, not after a long processing delay. That helped it become popular, and once it was popular, it stayed popular and crushed the competition due to the network effect.

anc84 · on Sept 20, 2017

You vastly underestimate the amount of knowledge in random blogs and small websites. If you are only after the "mainstream" sites, sure, but that's quite the bubble. I want and need access to all the blogs, forums, wikis.

shpx · on Sept 20, 2017

Sometimes, it feels Google is increasingly useless at finding those niche websites because all the noise has time to do search engine """optimization""".

I use https://chrome.google.com/webstore/detail/personal-blocklist... but it should come with pre-compiled lists, like uBlock Origin.

ianai · on Sept 20, 2017

I’m glad I’m not the only one to think that. I wonder what a good way to measure that signal to noise ratio is?

This feels like an application of the 80/20 rule. I might not always find what I need in just that offline index, but the times I do would seriously disrupt google.

snaky · on Sept 21, 2017

Great extension, thanks!

Oh, I wish there were a way to rate the results in a opposite way. Maybe it's just me, but sometimes I want even mark some site as 'a hidden gem'. Usually it's something rather niche though, like a bunch of great little articles about anti-aliasing filters design and sampling, full of engineering wisdom from the decades of experience. So I think it would be perfect to have a way to tag it by some topic.

pessimizer · on Sept 20, 2017

Google is pretty terrible at finding that stuff (as well as DuckDuckGo.) Instead, the first two or three pages of results are the SEO kings, often with identical copy. https://millionshort.com is a search engine that is entirely a reaction to that.

CamelCaseName · on Sept 20, 2017

Why not download an index of the 100,000 or 1,000,000 sites which appear most often weighted by position? If they don't reach a certain threshold of quality (maybe the query is too obscure) then resend the query online?

I don't have a CS background so I'm not sure, is this a reasonable solution?

anc84 · on Sept 20, 2017

The obscure sites are what make the core of the internet, at least in my circles.

But having a way of making sure that a certain topic or genre of sites is well indexed locally would work. I.e. "my" top 100k. Oh, how awesome would that be.

pcunite · on Sept 20, 2017

Google is deciding what is best for you to see.

kllrnohj · on Sept 20, 2017

According to https://www.noppanit.com/how-to-index-all-wikipedia-english-... a search index for wikipedia's english site comes in at around 40GB.

Laptops generally come in 128GB or 256GB capacities. So we're talking roughly 20-50% of an average laptop's storage just to hold the search index for wikipedia alone.

Meaning it is not feasible to store a useful index of the web today on a laptop unless you want to dedicate that laptop to pretty much exclusively searching wikipedia's english content.

You're not searching that database in milliseconds on average laptop hardware, either, but you probably could get it to be "fast enough" for practical usages if you could somehow solve the harder parts like storage size, freshness, and indexing more than just wikipedia's english content.

briandear · on Sept 20, 2017

That’s also assuming Wikipedia is worth having an offline index for. Wikipedia, for Bonn controversial, general knowledge has its value, but something gets lost with 1000 chefs in the kitchen. The beauty of the internet is that it isn’t Encarta.

Acalyptol · on Sept 20, 2017

Those are laptops with SSD. Laptops with HDD come in with 1TB+

kllrnohj · on Sept 20, 2017

Sure, but laptops with HDD are a niche case. That's not what your average user wants nor has and for good reason.

Your indexing performance would also sharply decline when subjected to the random access seek times of those 2.5" spinning metal.

rxhernandez · on Sept 20, 2017

I can promise you the average user of a laptop doesn't have the first clue about the difference between an SSD and HDD.

FroshKiller · on Sept 20, 2017

>As an example take the entire stackexchange and wikipedia dumps in their entirety(including images).

>That's an rough approximation of all known human knowledge.

Are you taking the piss?

visarga · on Sept 20, 2017

Your idea doesn't go far enough. What good are a couple of stale indexes of Wikipedia and Stackoverflow going to be?

Instead, I am wondering why there isnt a federated open source search engine. It would be a cloud of nodes, each node spidering and indexing a small hash bucket of urls. With a million such nodes we could have a live updated, distributed search engine to replace Google. We could run millions of queries without paying someone - we'd pay back by serving as a node, just like in BitTorrent. With all the interest in privacy, I wanted to see more discussion of replacing Google with an open, non-censured and private protocol for search.

boramalper · on Sept 20, 2017

> Instead, I am wondering why there isnt a federated open source search engine.

Not sure if federated by they call themselves decentralised: https://yacy.net/en/index.html

finnh · on Sept 20, 2017

Who would host the index? You can't contact one million peers to run a search, so fanout-on-read doesn't work. If the index is also distributed (by search term?), then the search+indexing nodes will need to fanout-on-write, which is less latency-sensitive but still pretty onerous.

tastythrowaway · on Sept 20, 2017

This is a brilliant idea. Let's work on that. Seriously. Do you want to work on this together?

runeb · on Sept 20, 2017

In the same vein, what good is an index of the web going to be? Googles value isn't their index, its how they translate your search terms into good results from that index. Anyone can index the web, not everyone can make sense of it from human questions.

visarga · on Sept 20, 2017

There has been a lot of research in search engines since AltaVista collapsed and Google won. Much of it is in public domain.

driverdan · on Sept 20, 2017

> So how large is an index to a 100 GB file? Generally around 1 GB. Let's say you use covering indexes with lot of meta data and up that to 5GB to support sophisticated queries.

You vastly underestimate index sizes. I'm not saying it wouldn't be realistic to download and index locally but they'll be much larger than 1-5%.

briandear · on Sept 20, 2017

So when there’s a new blog post on a topic of interest, how does that work? The internet isn’t an encyclopedia of a finite size.

Offline search is the equivalent of the Internet Yellow Pages from back in the day. New, relevant information is being added continuously. The search index 10 minutes ago is different than it is right now.

j_s · on Sept 20, 2017

One aspect of this is trying to find content that has already been viewed. There are tools to archive web browsing locally for future re-use. Any improvements to local search would benefit this use case tremendously.

Relevant discussion: "Wallabag: a self-hostable application for saving web pages" | https://news.ycombinator.com/item?id=14686882 (July 2017)

davedx · on Sept 20, 2017

Refreshing viewpoint.

Is there a way to easily sync search repositories, like Stackoverflow, Wikipedia etc., to your local computer with an automatically built search index?

eru · on Sept 20, 2017

That works for Wikipedia type knowledge, yes. But not for finding random blog posts from yesterday, or reddit or HN discussions.

wastedhours · on Sept 20, 2017

Stabilised != final though, so you'll still need a layer in order to find all of the new knowledge and compare it against what everyone has locally (not going to download 50GB every time someone posts a new useful SO answer), and update the local store. Why not keep that index centrally for everyone instead of replicating it for 2bn people?

TheReveller · on Sept 20, 2017

How would you keep the index up to date? How quickly would news stories appear in it and stale links disappear? Are you going to read the entire internet every day to keep it current?

DerfNet · on Sept 20, 2017

maybe you could pull down changes on a daily basis? not a complete reindex but just updates that prune dead links and add new ones.

the problem is this would be completely useless for current events. but querying sites you use regularly, this could be an option. I often end a google search with modifiers like "wikipedia" or "reddit" or "stackoverflow". If I could store indexes to those sites offline that might be useful.

The creator of this offline search could potentially make money by indexing shopping sites like Amazon, eBay, etc with affiliate links.

briandear · on Sept 20, 2017

Or I could let someone do it for me; say Google or DDG. I don’t have time for such nonsense. I’m not going to waste time hosting my own version of Apple Music, when I could just use Apple Music.

iDemonix · on Sept 21, 2017

If you struggle to get any of your indexing working, let me know and I can Google it, whilst you grep through your stale Stack Overflow archive.

Your reply seems pointless, yes I could download an entire copy of Wikipedia, compress it and index it - why would I want to? Are you going to do this on every machine you own? What about the updates Wikipedia receives every minute?

You should start a company though to focus on this, maybe you could call it Encarta or something?

hacker_9 · on Sept 20, 2017

Problem is that knowledge gets outdated quickly as things change, I usually have to specify 'only show results from the last month', else the information is not relevant. Having to constantly download an index would be demanding, and to be honest why bother? It's not like google is slow.

real-hacker · on Sept 21, 2017

Yes, we can build a local encyclopedia. But from a 'Green' viewpoint, which is more environmental-friendly: more Terabyte hard-drives sold? or a few MB of data transferring over the internet?

jtrip · on Sept 20, 2017

So, practically speaking, how would one go about setting this up for local use? I would this would also work very well for locally stored books and docs.

dx034 · on Sept 20, 2017

That would mean blogs are either excluded or suddenly get thousands of bot visits per hour.