Yeah, there are certainly more problems these days. For one, the size of the web is larger, more of it is spam causing issues with pure page rank to detect networks that heavily link to each other.
Important sites have a bunch of anti-crawling detection set up (especially news sites). It's even worse that the best user-generated content is behind walled gardens in facebook groups, slack channels, quora threads, etc...
The rest of the good sites are javascript-heavy and you often have to run chrome headless to render the page and find the content - but that is detectable so you end up renting IP's from mobile number farms or trying to build your own 4G network.
On the upside, https://commoncrawl.org/ now exists and makes the prototype crawling work much easier. It's not the full internet, but gives you plenty to work with and test against so you can skip to the part where you figure out if you can produce anything useful should you actually try to crawl the whole internet.
> but that is detectable so you end up renting IP's from mobile number
farms or trying to build your own 4G network.
Something is deeply wrong with such an adversarial ecosystem. If sites
don't want to be found and indexed why go to any effort to include
them? On the other hand there are millions of small sites out there
keen to be found.
The established idea of a "search engine" seems stuck, limited and
based on some 90's technology that worked on a 90's web that no longer
exists. Surely after 30 years we can build some kind of content
discovery layer on top of what's out there?
https://blogsurf.io/ is an example of a small search engine that just stuck to a directory of known blogs instead of indexing the big sites or randomly crawling the web and ending up with mostly gibberish pages from all the spam sites.
thank you for sharing this!
I read through the site's about and I really enjoy how the creator wanted to stick to a specific area for quality over quantity.
> Something is deeply wrong with such an adversarial ecosystem. If sites don't want to be found and indexed why go to any effort to include them? On the other hand there are millions of small sites out there keen to be found.
I work on a small - medium ecommerce website and my code just... sucks. I kind of don't want to admit it but it is true. When there is some Chinese search engine that tries to crawl all the product detail pages during the day (presumably at night for them?), it slows down the site to a crawl. I mean technically I should have the pages set up so they can't pierce through the cloudflare cache but it is easier to just ask cloudflare to challenge user (captcha?) if there are more than n (I think currently set to something small like ten) requests per second from any single source.
I don't understand all the business decisions but yeah, I'd suspect the biggest reason is we simply have poor codebases and can't spend too much time fixing this while we have so many backlog items from marketing to work on...
Why are page loads so slow or demanding? I can't imagine how a web crawler could be DoS'ing you if it's in good faith. What is the TPS? What caching are you doing? What's your stack like?
Not GP, but from having run a small/niche search engine that got hammered by a crawler in the past:
Webserver was a single VM running a Java + Spring webserver in Tomcat, connecting to an overworked Solr cluster to do the actual faceted searching.
Caches kept most page loads for organic traffic within respectable bounds, but the crawler destroyed our cache hit rate when it was scraping our site and at one point did exhaust a concurrent connection limit of some kind because there were so many slow/timing-out requests in progress at the same time.
Does the served HTML contain everything: the product, related products, comments, reviews, etc? If so, caching the entire page might be counterproductive.
But if the page is designed in such a way where it's broken up into fragments, then caching heavily is your friend.
Those fragments can either be loaded client-side (Javascript) or server-side (ESI, SSR, etc) -- but page fragments are key to increasing how much and how long you can cache.
And with this, you've added a layer of complexity that a small to medium e-commerce site may not be able to fulfill with their in-house talent.
> If sites don't want to be found and indexed why go to any effort to include them?
The news sites etc. want to be indexed by Google, and they whitelist Google so they show up in search results. They don't want to be crawled by random aggregators and scrapers which don't bring them ad views and subscriptions. Indexing for me and not for thee.
Other places like Slack channels are private or semi-private and shouldn't be indexed, I agree. There was some kind of scheme a while back (before Freenode went down the tubes in a completely different and unrelated way) to index Freenode. People rightfully freaked out about it and the scheme was shot down.
I think that is not what they mean. I think what they meant is the site will detect your headless robot and serve it good content, while serving spam and malware to everyone else. The crawlers need their own distributed networks of unrelated addresses to prevent or detect this behavior.
Maybe we need a categorized, hand curated directory of sites that users can submit their own sites to for inclusion and categorization. Maybe like an open directory. Perhaps Mozilla could operate it, or maybe Yahoo!
We could also make a website where people can submit links to great websites they find, and also allow then to vote on the submissions of other users. That way you have a page filled with the best links, as determined by users. Maybe call it "the homepage of the internet".
You could even add the ability to discuss these links, and add a similar voting system to those discussions.
I would love to see a combo of a search engine and a portal. Basically at any step drilling down into links for categories and subcategories, I could type in a search bar and every site at that level and below would be searched.
I've been sort of considering going this path with my own search engine, but I feel whenever there is any sort of interactivity, there is the serious problem of spam and abuse; I'm not quite sure how to solve that.
Feels like accepting user content no longer flies on the web in 2022 due to the effectiveness of bot spam and lack of viable countermeasures :-/
I think non-crowdsourced manual curation (by you/your trusted volunteers) would be a feature. Sure, search would have more limited results, but really with how much cruft there is I don't see that as a bad thing. Start with the "known good/useful" domains and branch out from there.
Maybe. The upside of integrating it with a search engine is that you kinda can get dead link detection for free. Otherwise link rot is a real bitch to deal with for web directories maintained by few people.
Could probably back it off some human-readable format in a git repo or something, pull requests might help with moderation, and recruiting volunteers might be easier if the data is open and available/forkable for other projects.
>> If sites don't want to be found and indexed why go to any effort to include them? On the other hand there are millions of small sites out there keen to be found.
Then they should treat all bots equally and block Google as well. If they block Google as well, then yes, we should leave them alone.
Why give unfair treatment to Google? That's anti-competitive behavior and it just prevents new search engines from being created.
I think I understand, combined with jeffbee's answer, that these sites
are behaving selectively according to who you are. So we're back to
"No Blacks or Irish" on the 2022 Internet?
What do you think they have against smaller search engines? I can't
quite fathom the motives.
There are a lot of crawlers out there, and many of them are ill-behaved. When GoogleBot crawls your site, you get more visitors. When FizzBuzzBot/0.1.3 comes along, you’re more likely to get an overloaded server, weird URLs in your crash logs, spam, or any other manner of mess.
Small search engines getting blocked is just collateral damage from websites trying to solve this problem with a blunt ban-hammer.
> So we're back to "No Blacks or Irish" on the 2022 Internet?
I think this is the inevitable result of surveillance capitalism. Once you can reliably be identified as holding certain views or meeting some arbitrary criteria pages can be dynamically altered to restrict what you have access to. A racist social platform can look completely benign to anyone who isn't part of the in-group, forums for teens can automatically lock out anyone suspected of being over a certain age, companies can show products or services only to people within a certain class, etc.
We already restrict a lot of access to content based on guesses about location. Youtube doesn't even let you see that certain content on their platform exists on their site unless you're logged in from an IP in specific counties (as opposed to showing you the videos exist and throwing up an error saying it's not available for you when you try to view it) and netflix blocks VPNs just to try to keep a bunch of their content away from the wrong kinds of people.
These are the kinds of barriers the internet should have freed us from, but instead it's being used to put up more gates and to force people into them.
- text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)
- etc...
I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.
A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. That might work for something small like a curated collection of a few hundred sites.
Yeah, this is the great part about working with search, you really do get to sort of go to the gym when it comes to software engineering breadth. From hardware, to algorithms, to networking, to architecture, to UX, there is really an interesting problem everywhere you turn to look. Even just writing a file to disk is a challenge when the file is several dozen gigabytes and needs to be written byte-by-byte in a largely random order.
This does go some way toward explaining why Google interviews looked the way they've looked. It's just a shame everywhere else has copied their homework, without actually needing the same skills.
Yes, yes, yes :D There are so many topics in this space that are so interesting it's like a dream. I would add to your list
- sentiment analysis
- roaring bitmaps
- compression
- applied linear algebra
- ai
In a vent diagram intersecting all of these topics, is search. Coding a search engine from scratch is a beautiful way to spend ones days, if you're into programming.
True story I once had a discussion with a developer about search in general like for your website, for the internet and the difficulties involved. Precision vs recall, relevancy vs popularity, ranking etc.
He was dumbfounded that i would want to spend two weeks 'tunning Solr queries' for a project. He asked( nay stated)
Even though Common Crawl exists, if you wanted to get the data locally to have a bit of a poke around, it's practically impossible.
The next search engine will never be made by a few kids playing around in the shed.
The latest Common Crawl is roughly 120TB, enough to fit on 48 LTO-6 tapes. LTO-6 is the perfect middle ground expense-wise if you did want to play with the entire dataset at home. The tapes cost ~$30 (AUD) each, a cheap loader from eBay can be had for around ~$1000-4000 (like a Dell PowerVault or Fujitsu Eternus).
Your looking at around $8k just to have your data close to your computer. Either that, or you are going to just run everything on AWS/Google Cloud/Azure. Two of these are your competitors.
No one is going to copy 48 tapes for you, either :)
120TB would take around three months to download - if my ISP didn't cut me off first for probably being the heaviest consumer user of bandwidth.
https://www.chatnoir.eu/ has a search engine that is built from the CommonCrawl data running on Elasticsearch, and it runs on 130 nodes [0].
I would love to be able to run my own search engine, but unless there are a number of breakthrough algorithms, I don't see how it can be easily achieved.
I did ask some guys from the Internet Archive if they would be happy to copy some data for me onto some tapes. I wonder if there is a service for this?
I've sort of mostly dismissed common crawl for the same reason when building my search engine. It's simply too unwieldy. It's far easier (and cheaper) to do my own crawling at a manageable scale, than it is to work with CC's datasets.
I'm also not entirely sure which problem CC is intending to solve. It's not like their data is in any way more complete than my own crawl sets, what I can't access to crawl, they don't seem to crawl either.
I don't know how people can use the data. There's so much of it! I don't see any harddrives that are 80TB. It seems like people would need some kind of raid setup that can handle 200+TB of uncompressed data
You don't need to download the whole thing. You can parse the WARC files from S3 to only extract the information you want (like pages with content). It's a lot smaller when you only keep the links and text.
A search index is often made of smaller independent pieces often called segments. So you can download & process progressively the data locally and upload it to an object storage. And run queries on it. That's what we did here for this project: https://quickwit.io/blog/commoncrawl
Glad to see this on the front page. One of those posts I reread every now and then. Better yet it’s written by Anna Patterson, who in addition to the mentioned searches at the bottom wrote chunks of Cuil (interesting even if it failed) and works on parts of Googles index both before Cuil and I think now.
Sadly it’s a little out of date. I’d love to see a more modern post by someone. Perhaps the authors of mojeek, right dao or someone Elise running their own custom index. Heck I’d pay for some by Matt Wells of Gigablast or those behind Blekko. The whole space is so secretive that for those really interested in the space only crumbs of information are ever really released.
Thanks for the mention @boyter. Maybe we should write something like "Why Writing Your Own Search Engine to Index Billions of Pages is Hard". It would make interesting reading about challenges we have overcome. For us it's not being secretive as much as finding bandwidth. Getting awareness is a massive challenge too, which is why we've written a lot more content in the last two years. I'm sure you are looking for something more meaty than this: https://blog.mojeek.com/2021/05/no-tracking-search-how-does-...
Hey folks, I am one of the co-founders of neeva.com
While writing a search engine is hard, it is also incredibly rewarding. Over the past two years, we have brought up a meaningful crawl / index / serve pipeline for Neeva. Being able to create pages like https://neeva.com/search?q=tomato%20soup or https://neeva.com/search?q=golang+struct+split which are so much better than what is out there in commercial search engines is so worth it.
tldr; there are two big challenges in crawling the web:
* quantity -- how do you crawl the web at O(B) pages per day
* quality -- how do you make sure you are crawling the high quality parts
On the quantity side, the web is a much trickier place to crawl than 10-15 years ago. Even if you build a system that is well behaved, respects rate limits and work w/ webmasters and CDNs to make sure you are treated as a good bot, the following things will bite you:
* some sites only allow googlebot/bingbot via robots
* even when sites allow all search crawlers, you'll see 429s, 503s, crawl delay directives that limit you to very low qps, and other mysterious errors. (most retailers and aggregators)
* working w/ webmasters works only if you manage to get a response from them (good luck)
* many sites require JS rendering (which is 100x more expensive in terms of the number of assets you are crawling)
On the quality side, crawl prioritization works best when you have click data, which most small search engines don't have enough of. In the absence of that, good seed sets and high quality inlinks are your friend.
In other news, eng blogs are being written up as we speak. Will post here on HN as they roll out. Also, like marginalia_nu pointed out, we share whatever we've been working on on a weekly basis on our Twitter account.
Useful context for this is that Anna Patterson started the search engine company Cuil in 02008, four years after writing this article (when she was still at Google). Its results were bad enough that the "Cuil Theory" meme was launched on Reddit and Tumblr making fun of it: https://knowyourmeme.com/memes/sites/cuil-theory:
> One Cuil = One level of abstraction away from the reality of a situation.
> Example: You ask me for a Hamburger.
> 1 Cuil: if you asked me for a hamburger, and I gave you a raccoon.
> 2 Cuils: If you asked me for a hamburger, but it turns out I don't really exist. Where I was originally standing, a picture of a hamburger rests on the ground.
> 3 Cuils: You awake as a hamburger. You start screaming only to have special sauce fly from your lips. The world is in sepia.
> 4 Cuils: Why are we speaking German? A mime cries softly as he cradles a young cow. Your grandfather stares at you as the cow falls apart into patties. You awake only to see me with pickles for eyes, I am singing the song that gives birth to the universe.
Two years later, in 02010, the founders shut down the search engine, laid off all the employees, sold its patents to Google, and became Google employees again. Now all that remains of Cuil is the Cuil Theory Wiki.
I'd be really interested to see a retrospective on what went wrong. I guess writing your own search engine really is hard, but I'd like to know what turned out to be so much harder than they expected.
The hardest part of building your own search engine is not that it is technically hard (it's actually pretty easy) but that the bar for success is just really high. There are a few existing engines and they offer their services for free. So, not only is the bar really high, you have no viable revenue model and this stuff gets expensive quickly.
Or put differently, if you are going to replicate what existing search engines already do, you are probably not going to be as good initially and you are going to struggle making money. Fixing the money part is the hard part.
Would be interesting to see stats from that time how many people were working on search engines and how it turned out for them. Did they end up getting acquired, at least funded for a while, exited, or just bootstrapped themselves until they realized there'll only be one winner.
This is pretty reasonable for 2004, but the problems have changed. Everything on that page is totally doable for a serious engineering team, and has been done many times. The real hard part is what they briefly touch on with:
Don’t do page rank initially. Actually don’t do it at all. For this observation I risk being inundated with hate mail, but nonetheless don’t do page rank. If you four guys in your garage can’t get something decent-looking up without page rank, you’re not going to get anything decent up with page rank. Use the source, Luke—the HTML source, that is. Page rank is lengthy analysis of a global nature and will cause you to buy more machines and get bogged down on this one complicated step—this one factor in ranking. Start by exploiting everything else you can think of: Is the word in the title? Is it in bold? etc. Spend your time thinking about anything you can exploit and try it out.
The web is full of sites that want to rank, since traffic makes money and appearing high in search results gets you traffic. Simple handling of HTML source is incredibly gameable, and while it might have worked okay on the web of 2004 it definitely is not enough now. It's you and your team verse an enormous number of SEO people.
As best I recall, one of the first SEO tricks took advantage of simple algorithms that ranked pages based more or less on word frequency, with certain synonyms collapsed, eg "movies, films, flicks, and cinema" would all be consider synonyms.[1] It took exactly seven minutes for site authors to figure out keyword stuffing.
1. Synonym collapsing isn't even necessarily a good thing, As Clay Shirky noted in "Ontology is Overrated -- Categories, Links, and Tags", people who enjoy watching movies and people into cinema are different crowds. https://oc.ac.ge/file.php/16/_1_Shirky_2005_Ontology_is_Over...
I’ve been puttering away at making a search engine of my own (I should really do a Show HN sometime); let’s see how my experience compares with 18 years ago:
Bandwidth: This is now also cheap; my residential service is 1 Gbit. However, the suggestion to wait until you’ve got indexing working well before optimizing crawling is IMO still spot-on; trying to make a polite, performant crawler that can deal with all the bizzare edge cases (https://memex.marginalia.nu/log/32-bot-apologetics.gmi) on the Web will drag you down. (I bypassed this problem by starting with the Stack Exchange data dumps and Wikipedia crawls, which are a lot more consistent than trying to deal with random websites.)
CPU: Computers are really fast now; I’m using a 2-core computer from 2014 and it does what I need just fine.
Disk: SATA is the new thing now, of course, but the difference these days is HDD vs SSD. SSD is faster: but you can design your architecture so that this mostly doesn’t matter, and even a “slow” HDD will be running at capacity. (The trick is to do linear streaming as much as possible, and avoid seeks at all costs.) Still, it’s probably a good idea to store your production index on an SSD, and it’s useful for intermediate data as well; by happenstance more than design I have a large HDD and a small SSD and they balance each other nicely.
Storing files: 100% agree with this section, for the disk-seek reasons I mention above. Also, pages from the same website often compress very well against each other (since they’re using the same templates, large chunks of HTML can be squished down considerably), so if you’re pressed for space consider storing one GZIPped file per domain. (The tradeoff with zipping is that you can’t arbitrarily seek, but ideally you’ve designed things so you don’t need to do that anyway.) Also, WARC is a standard file format that has a lot of tooling for this exact use case.
Networking: I skipped this by just storing everything on one computer; I expect to be able to continue doing this for a long time, since vertical scaling can get you very far these days.
Indexing: You basically don’t need to write anything to get started with this these days! I’m just using bog-standard Elasticsearch with some glue code to do html2text; it’s working fine and took all of an afternoon to set up from scratch. (That said, I’m not sure I’ll continue using Elastic: it has a ton of features I don’t need, which makes it very hard to understand and work with since there’s so much that’s irrelevant to me. I’m probably going to switch to either straight Lucene or Bleve soon.)
Page rank: I added pagerank very early on in the hopes that it would improve my results, and I’m not really sure how helpful it is if your results aren’t decent to begin with. However, the march of Moore’s law has made it an easy experiment: what Page and Brin’s server could compute in a week with carefully optimized C code, mine can do in less than 5 minutes (!) with a bit of JavaScript.
Serving: Again, ElasticSearch will solve this entire problem for you (at least to start with); all your frontend has to do is take the JSON result and poke it into an HTML template.
It’s easier than ever to start building a search engine in your own home; the recent explosion of such services (as seen on HN) is an indicator of the feasibility, and the rising complaints about Google show that the demand is there. Come and join us, the water’s fine!
It’s an interesting idea, and I have a feeling I’ve read somewhere about Google doing something similar, but these sorts of heuristics are a rabbit hole I’m trying not to go down at the moment—currently I have far more serious reasons why my ranking isn’t great, so I’m trying to prioritize those rather than getting distracted by whatever interesting algorithm I happen to bump into. (Definitely on my list to investigate in the future, though!)
Doesn't mention the hardest part I found when developing a crawler - dealing with pages whose content is mostly dynamic and generated client side (SPA's). Even using V8 it's hard to do reliably and performantly at scale.
That was about when I was writing my crawler (not for search but for rules-based analysis). Even in 2004 a lot of key DOM elements were created/modified client side.
Though I do remember now that we solved it by having a separate mechanism for accessing pages that required logging in or had significant client-side rendering by allowing the user to record a macro that was played back in a headless browser. Within a few years though it was obvious a crawler would need to be able to automatically handle client scripts.
Is this an actual problem, though? It seems like not even Google deals well with these types of pages.
Often those types of pages are difficult to link to as well, as they're often highly stateful, applications rather than documents, and what you're seeing may not be what you get when you click the link. Not really what you want in a search engine.
This is from the "doesn't scale" quadrant, but if you are not confident that your bot will behave well, shouldn't you supervise everything it does closely until you become confident?
Text search on the web will slowly die. People will search video based content, and use the fact that a human spoke the information, as well as comments/upvotes to vet it as trustworthy material. Google search as we know it will slowly die, and then will decline like Facebook. TikTok will steal search marketshare as their video clips span all of human life.
Returning text results in response to queries will continue to decline in favor of returning answers and synthesized responses directly. I don't want Google to point me to a page that contains the answer somewhere, when it could provide me an even better summary based on thousands of related pages it has read.
Right but the main flaw with Google is people increasingly dont trust the result whether it is synthesized or not. And Google is in the adversarial position of wanting to censor certain answers as well as present answers that maximize their own revenue. An answer (like video based TikTok), will arise and crush them eventually.
Important sites have a bunch of anti-crawling detection set up (especially news sites). It's even worse that the best user-generated content is behind walled gardens in facebook groups, slack channels, quora threads, etc...
The rest of the good sites are javascript-heavy and you often have to run chrome headless to render the page and find the content - but that is detectable so you end up renting IP's from mobile number farms or trying to build your own 4G network.
On the upside, https://commoncrawl.org/ now exists and makes the prototype crawling work much easier. It's not the full internet, but gives you plenty to work with and test against so you can skip to the part where you figure out if you can produce anything useful should you actually try to crawl the whole internet.