I do indeed index the web myself. Not the entire web, just a subset of it. The crawler quickly loses interest in javascript:y websites and only indexes at depth those websites that are simple. It also focuses on websites in English, Swedish and Latin and tries to identify and ignore the rest (best-effort).
You'd be surprised how much you can do with modern hardware if you are scrappy. The current index is about 17.7 million URLs. I've gone as far as 50 million and could probably double that if I really wanted to. The difficulty isn't having a small enough index, but rather having a relevant enough index, weeding out the link farms and stuff that just take space.
I only index N-grams of up to 4 words, carefully chosen to be useful. The search engine, right now, is backed by a 317 Gb reverse index and a 5.2 Gb dictionary.
I have only one recommendation that might make the search a bit more relevant, e.g when searching for 'linux locking' or 'kernel locking' kind of things.
Try to upsort things that match near the top of the content, like the top of the man page vs middle vs bottom.
One easy way to do it without having to store the positions, is to index the ngrams with max(sqrt,8) of their line number, this will cover first 64 lines, you can also use log() or just decide ad hock, top, middle, bottom of the document, so you can use only 3 values.
then just make the query "kernel locking" to "dismax(kernel_1 OR kernel_2 OR kernel_3...) AND dismax(locking_1 OR locking_2 ...) with some tiebreaker of 0.1 or so, you can also say "i want to upsort things on the same line, or few lines apart" by modifying the query a bit.
It works really well and costs very little in terms of space, i tried it at https://github.com/jackdoe/zr while searching all of stackoverfow/man pages and etc and was pretty surprised by the result.
This approach is a bit cheaper than storing the positions because positions are (lets say) 4 bytes per term per doc, while this approach has fixed uppre bound cost of 8*4 per document (assuming 4 byte document ids) plus some amortized cost for the terms
Cool, I've been thinking on this topic a bit lately. Crawling is indeed not that hard of a problem. Google could do it 23 years ago. The web is a bit bigger now of course but it's not that bad. Those numbers are well within the range of a very modest search cluster (pick your favorite technology; it shouldn't be challenging for any of them). 10x or 1000x would not matter a lot for this. Although it would raise your cost a little.
The hard problem is indeed separating the good stuff from the bad stuff; or rather labeling the stuff such that you can tell the difference at query time. Page rank was nice back in the day; until people figured out how to game things. And now we have bot farms filling the web with nonsense to drive political agendas, create memes, or to drown out criticism. Page rank is still a useful ranking signal; just not by it self.
The one thing no search engine has yet figured out is reputability of sources. Content isn't anonymous mostly. It's produced and consumed by people. And those people have reputations. Bot content is bad because it comes from sources without a credible reputation. Reputations are built over time and people value having them. What if we could value people's appreciation relative to their reputability? That could filter out a lot of nonsense. A simple like button + a flag button combined with verified domain ownership (ssl certificates) could do the trick. You like a lot of content that other people disliked, your reputation goes down the drain. If you produce a lot of content that people like, your reputation goes up. If a lot of reputable people flag your content, your reputation tanks.
The hard part is keeping the system fair and balanced. And reputability is of course a subjective notion and there is a danger of creating recommendation bubbles, politicizing certain topics, or even creating alternative reality type bubbles. It's basically what's happening. But it's mostly powered by search engines and social media that actually completely ignore reputability.
> The hard part is keeping the system fair and balanced.
It is, which is why I think the author should stay away from anything requiring users to vote on things.
The problem with deriving reputability from votes over time is in distinguishing legitimate votes from malicious votes. Voting is something that doesn't just get gamed, it gets gamed as a service. You'll have companies selling votes, and handling all the busywork necessary to game the bad vote detector.
Search engines and social media companies don't ignore this topic - on the contrary, they live by it. The problem of reputation vote quality is isomorphic to the problem of ad click quality. The "vote" is a click event on an ad, and the profitability for both the advertiser and the ad network depend on being able to tell legitimate clicks and fake clicks apart. Ludicrous amounts of money went into solving this problem, and the end result is... surveillance state. All this deep tracking on the web, it doesn't exist just - or even primarily - to target ads. It exists to determine whether a real would-be customer is looking at an ad, or if it's a bot farm (including protein bot farm, aka. people employed to click on ads en masse).
We need something better. Something that isn't as easy to game, and where mitigations don't come with such a high price for the society.
> It also focuses on websites in English, Swedish and Latin and tries to identify and ignore the rest
When I search for Japanese terms, it "says <query> needs to be a word", which wasn't the best error message. Maybe the error message should say something like "sorry, your language isn't support yet"?
Not OP, but if I was to do this, I'd start by downloading Wikipedia and all its external links and references, and crawling from there. You should eventually reach most of the publicly visible internet.
I feel a little embarrassed that I didn't think of something like that.
When I did some crawler experimenting in my younger years, I thought I was pretty clever using sites that would let you perform a random Google searches. I would just crawl all the pages from the results returned.
Your method would undoubtedly be more interesting I think. It would certainly lead to interesting performance problems quicker, I bet.