I want to make my own search engine, one day, with my own crawler.
There is an SEO-proof way to determine what the ranking of a site should be - penalise it for each advertisement, penalise further for delivering different content to the crawler[1], allow logged-in users to down-rank a site, etc.
Basically, a site starts off with a perfect score, then gets penalised for each violation, for each dark-pattern, for each anti-user decision they took.
This sort of ranking cannot be gamed by SEO spammers, because ... why get their site to the top of the rankings if they can't put any advertisements on it?
Even if they do manage it, their time in the sun will be brief.
[1] Random spot-checks with different user agents with different fingerprints should work.
It's your ranking system, so it's correct by definition (for you), but I've got to point out you're not considering the quality of the text. In your model, an empty page is perfect.
> It's your ranking system, so it's correct by definition (for you), but I've got to point out you're not considering the quality of the text.
Yes. This is by design![1]
> In your model, an empty page is perfect.
Yes. What's the problem with that? An empty page won't match any search terms, will it?
[1] It's impossible to quickly and cheaply determine the quality of the text, in an age where it's cheaper for a blog-spammer to use AI to write 3k articles than for the crawler to use AI to check 3m articles.
You might be onto an interesting new view of the web here, if you've ever seen https://www.builtwith.com, something similar that also processes script tags and downweights based on what the script is would also add to this effect.
At the very least I'd be interested in looking at such a site and seeing how it differs from https://www.millionshort.com =)...
Nah, you need some way to consider the quality of your text. No way around that.
Let's say I publish a page -- nothing fancy to it, no dark patterns at all -- just the word "architecture". For anybody searching for the term "architecture", this is now the perfect page -- a 100% match to their search, with no dark patterns at all. Completely useless, but perfect.
What would my motivation be for doing something like this? Well, imagine that I now deploy 100,000 identical or very similar pages. Your search engine is now fully _nuked_ for those terms. Does your competitor have a product that you don't want being found via the search engine? I'll happily nuke its search terms for you, for a fee of course. That's my motivation. Remember that advertising is not just a competition for you to be seen, but also for your competitors to not be seen. Your search engine, as-is, would be great at enabling the latter.
(Maybe you could fix this by allowing down-votes from logged-in, IP-logged users... but with 100k pages to downvote, that's not gonna help. So maybe you take pages which have been downvoted and do similarity comparisons to pages which haven't. A simple Levenshtein distance is too crude and easily-gamed, but maybe you could feed the downvotes into a neural-network classification system... and down the rabbit hole you go.)
Doing this only makes sense if the search engine gained some nontrivial traction, at which point it becomes feasible to hire a small number of students to manually delist obvious garbage.
I guess this will hold up as long as you're not the №1 default search engine.
If it is actually as much better as it sounds, even apple users will find a way to set their search engine to what works once word of mouth gets around. (Until Apple turns that option off and waits for for the EU to pass another gatekeeper law anyway.)
If it is actually as much better as it sounds, one wonders why Google or Duckduckgo doesn't do this already. For Google, killing the advertisement incentive is probably counterproductive in the long run, but why doesn't DDG?
Karma, occasional moderation rights, and occasional meta-moderation rights worked pretty well at Slashdot, for instance. It depends heavily on user involvement though.
Yeah, there is a dilemma. To get a wider audience, you need very good search results. But if you search quality depends on the amount of users, how do you get in there?
People like power, that keeps them moderating (looking at Facebook groups). What kind of reward system works here?
Well the baseline search quality should not rely on users. It should be great out of the box until spammers begin targeting it directly. Relying on users to moderate should happen organically as the number of users grow, which ideally would be timed to offset the growth in spam targeted at the engine.
The issue is that it needs the really really good on baseline before people start switching from other services.
But if you are able provide better baseline than others, then you might not even need the users to be "better. With users, you would be then exceptional.
I suppose services like Gmail benefit a lot from users who mark the occasional spam letter in their inboxes. While each such marking is inconsequential, correlated over many users, they allow to detect new and unusual waves of spam, and to start filtering them for everyone.
A similar mechanism could work for a search engine. It would be harder to game if it required high karma (many other users matching your judgment before it starts to count), and a paid account, like with Kagi.
There's also the issue of users trying to downrank any polarizing site because they disagree with it. To pick an example, think of the search term "vaccine" and how strongly people on both sides feel about it. Eventually, you'll probably end up with results for "vaccine" that have nothing whatsoever to do with vaccines because everything vaccine-related will end up downranked.
I keep thinking of a hierarchical invite model for these sorts of problems. I'd be very curious if someone could give a second opinion on this idea.
The mechanism:
Everyone has to be invited by someone, so it all traces back to the creator. The creator knows they themselves are legit, but let's say someone online asked for an invite and bad inputs keep coming from somewhere down that branch of the invite tree. Either the person the creator invited is a spammer, or someone they've invited in turn is the spammer. All accounts leading up to them can be progressively killed (and their inputs nullified), starting with the ones actually causing trouble → if it keeps happening in that branch then kill one layer up, and so on.
Incentives:
People risk losing their own account when inviting someone who they don't trust to be good netizens. Maybe there needs to be an incentive why you care about your account in the first place, or maybe (looking at Wikipedia or OpenStreetMap, or HN with its voting system and the homepage meaning a lot of attention for your page) a majority of people are simply honest and happy to contribute and that suffices
Problem it solves:
Wouldn't such a hierarchical invite system work around the online identification problem?
If you would DNA check everyone and ban people who abused the system in the last decade, you'd also not have any spam online, but that's way too invasive (besides not being legal and prohibitively expensive, it's also not ethical). However, a pseudonymous (all that is known about you is a random user ID) invite tree seems to me like it would have similar properties. It requires banning the same person perhaps a hundred times until they run out of people that will give them invites, but wouldn't it eventually distill the honest people from the population? (Which is probably almost everyone if there is no gain from systematic cheating and there's social pressure to not ask for multiple invites because account holders know that means you were either messing with the system they enjoy using or invited someone onto it that did that.)
(Implementation details: One bad input isn't an instant ban: people misclick or misunderstand interfaces, but eventually it gets to a point where, if they can't click the right buttons, there's also no point having them be moderators of the search engine (or whatever this is used for) and so their account is removed. If multiple removals happen in a tree that's deep and recent, remove more than one layer at a time to get rid of malicious sockpuppet layers. The tree's maximum depth can be limited to something on the order of 50: it doesn't take many steps to find a chain of relationships that links two random persons on the planet, so a fairly low depth is enough for the whole world. People should be told on the invitation page how many bad apples were removed in each layer below them, so if they're 1 bad apple removed from having their own account pruned then they know to only invite people they're very sure about. One problem I see with the system is that it reveals social graphs which not everyone is happy about. If that means being able to kill virtually all spam, content farms, etc., maybe it's worth it, but or course more research is needed beyond an initial proposal.)
This is the traditional way societies enforce totalitarianism. You're not allowed to sleep outside, because your landlord will punish you. He has to, because if he doesn't, the city will punish him. It has to, because if it doesn't, the state will punish it.
Maybe totalitarianism is what you want for search engine moderation. But it does feel a bit messed up when you frame it like I just did. Another thing that happens in these power delegation hierarchies is that people undermine each other in order to move up the hierarchy.
Isn't that what's happening to websites today? Everyone jostling for a higher position in the ranking to stay afloat on advertisement returns, and dancing to the advertisers' tune or they'll get kicked out of the system
I'm also not sure I agree that a system with this amount of leeway (anyone can invite you onto the system and it asks nothing of who you are) is comparable to a system where your landlord gets to say how you must behave in frivolous detail. All it wants from you is to not promote spam. Furthermore, physically moving to another landlord, who in such a scenario would want to get a reference, is a whole other ordeal compared going online and asking one of any number of people you're close to for a code that takes them 5 seconds to generate
"I invented a ranking system that cannot be gamed!"
By my experience, ALL ranking systems can - and will be gamed, no matter what. I'm still willing to give you the benefit of the doubt - maybe you are the one who can finally truly come up with an un-game-able system. But I'll consider that an extraordinary claim - and will require to see extraordinary proof for that.
> why get their site to the top of the rankings if they can't put any advertisements on it?
You would need to penalize any website that accepts money from users. There are a lot of SaaS companies out there writing blog spam to promote subscription products.
> You would need to penalize any website that accepts money from users. There are a lot of SaaS companies out there writing blog spam to promote subscription products.
I don't see that as a problem, because blog-spam without advertisements is going to get down-ranked until it is exactly where it needs to be. If a single site dominates too much, then a hard cap on how how often a specific site can appear in the results is feasible, or weighing each page with a percentage of all the pages on a site (thereby penalising sites that have millions of pages). Specific exceptions can be made for wikipedia, or similar.
For example, lets look at digital-ocean: Is that constant blogging something that users find helpful? Well, if a page is relevant to the search term then there's no problem. If it's not then the numbers of 'irrelevant to this search' feedback from logged-in users will gradually down-vote it until it is correctly ranked. If it's relevant but is one of 30k pages on the site that matches relevancy with that search term then it gets down-ranked.
Basically, the motivation to do nothing but spam is completely eliminated. People spamming "How to do $TRIVIAL-THING" to drive engagement with their SaaS product is, I feel, a completely valid result if the search term matches relevancy with the specific blog post, and the site isn't doing a barrage of postings.
Humans respond to incentives. Taking away the incentive to generate endless drivel and replacing it with the incentive to provide good content is the intention behind my proposal.
Specific exceptions can be made for wikipedia, or similar.
Once you go down the dark path of white-listing sites, you’ve admitted defeat. How are you going to find that upstart Reddit competitor or federated Wikipedia competitor?
Some of that blog spam is genuinely useful, so you need to differentiate between content quality with brands writing content to promote their saas product.
I have tried it before, but it appears to me that it dismisses all commercial content altogether, which is not what I want. That is throwing out the baby with the bathwater.
For example, I searched for "The Vietnam of computer science" (Without the quotes) and it returned zero results.
Fixed the bug. Some of the results are bit weird though, seems to be more vietnamese computer science than atwood. Interesting case, will probably be very useful for optimizing the result relevance.
> >but it appears to me that it dismisses all commercial content altogether, which is not what I want
> Isn't it?
Of course not. Appearing lower in the results than non-monetised content is very different from not appearing in the results at all.
> How is commercial content supposed to be funded?
They'll find a way. After all, if more people turn to ChatGPT for queries than to search engines, those sites are under the same sink-or-swim pressure that they would be under if demonetised search engines were dominant.
If a site doesn’t appear on a Google results page, it dies. If a site doesn’t appear on your results page, people switch back to Google to search for it.
As a new search engine you’re going to be faced with this sort of thing for a long time before users begin to trust you. If a user actually wants to find a commercial site but simply can’t, they’re not going to stick around for long.
>Appearing lower in the results than non-monetised content is very different from not appearing in the results at all.
I'd say it's only technically different unless there are only a handful of results. It probably might as well not appear in results if it's past the twentieth position. On Google, click through rate is already down to ~1% by the tenth result.
I wish you luck and I hope you succeed, but you make it sound much much easier than what it would be.
First of all, you're going to drown in hardware costs, if you run your own hardware. If you run on AWS, you will be the largest AWS customer. 2 years ago, when Google was still displaying result counts, I got 1.3 billions results for "sushi"[1]. This means that if you use a reverse index to lookup your results, the "sushi" entry will be ~19GiB large, assuming you use UUIDs. If you think 90% of this is spam, and you only index the non spam (detecting spam/seo is far from trivial, but let's say you figure it out), you still need ~2GiB just for mapping "sushi". With 755,865 words in the English dictionary, according to wikipedia[2], you'll need ~1.5 PiB (yes, pebi/peta, 1,536 TiB) just to store relationships for English pages. This is assuming you don't support other languages, you discard 90% of pages, and you don't cache the content of pages for re-indexing.
In addition to this, you also need to store the meta-data for each pages (vote counts from your voting system, whether it's serving different content, etc...). The order of magnitude has to be in the O(100TiB) from my conservative gut feeling. (still assuming you discard 90% of the web, and I'll assume you aggregate the metadata on the domain, not on the individual pages)
The second challenge is your ranking. Now that you've become the dominant search engine with your awesome ranking system, you will become the main target for swaths of motivated click-farms which are exploiting workers from low income countries. They will be trying to register accounts, vote and game your ranking. You can most likely detect this behaviour, but their behaviour will be very similar to a significant portion of your real users. So you'll be fishing in a pond with a rocket launcher, and some of your legitimate users will be collateral victims. Otherwise, you'll spend most of your time playing a cat-and-mouse game with the SEO spammers instead of improving your search engine and fixing bugs.
I'm also falling in the trap "i could rewrite that in a weekend" sometimes, but for a search engine, I would love to see decent competition, but it's near impossible.
I was just idly threatening to do something, not actually starting a venture to do it.
But, in any case, lets go with your numbers for running costs: in 2024 money, what is your estimation of the running costs?
I ask because there have been a number of new search engines pop up, and they have nowhere near the expenditure power of google, yet they still have devoted followings.
> The second challenge is your ranking. Now that you've become the dominant search engine with your awesome ranking system [snipped problems that follow]
TBH, that's the best kind of problems to have. Lets become dominant first before we say there's no point in becoming dominant.
Regarding your question of cost, if you buy the cheapest hardware possible, and manage it yourself, because you don't want to pay AWS' premium, for storage (again, assuming to discard 90% of the pages, and only do english) you'll be at least at €80k (12 × €3000 disk servers with 300 × €150 HDDs) of upfront cost, then, if you colocate at Hetzner, just for storage, you'll bet at €500/month + ~€3k of electricity that you need to pay yourself.
My gut feeling is that this would cost (only 10% of the English-speaking web) ~€150k of upfront cost and €5k/month to run. Assuming you buy the cheapest everything, and do your own admin sys. And this is not forecasting growth, serving ads, etc...
Absolutely spot on. Additionally, it's worth mentioning that a lot of content is now locked behind a few major platforms (eg. Facebook, LinkedIn, Medium, YouTube, etc.) or CDNs like Cloudflare, which often block crawling from non-Google IPs or well-known search engines.
While the other costs mentioned here can be optimized with current hardware prices and a good database, anti-crawling measures necessitate thousands of IPs/proxies, making the process even more challenging and costly.
Additionally, it's worth mentioning that a lot of content is now locked behind a few major platforms (eg. Facebook, LinkedIn, Medium, YouTube, etc.) or CDNs like Cloudflare, which often block crawling from non-Google IPs or well-known search engines.
I think this is fine. If I want to find something on one of those big sites I just go there directly. However if I want to search the web for a site I’ve never been to before then I’m stuck with the bad results of the current search offerings. It’s quite depressing!
Kagi obviously manages it. The Internet is big, but most of it is spam, which you can discard. You don't need relationships between all pages, just all websites. You should track the quality of the website, not each page. Google result counts are fake.
I don't know how Kagi manages it. I suspect like duckduckgo, they index a little bit on their own, and use a GBY [1] as a back up. According to Seirdy[1], they use Brave in the background. Brave just burn their cryptomoney to build a search engine. Don't get me wrong, I like what their doing, but it was easy for them to start since they bootstraped from Cliqz' index[2]
And for the second part, you do need to store the relationship between keywords and pages, that's what I was talking about. You cannot store a relationship between "types of water" and "reddit.com" you need to store it between "type of water" and "reddit.com/r/hydrohomies/..."
But that web doesn’t exist anymore. The dominance of Google means all the creators and authors went elsewhere. Filter out the crap and you’ll realise that there’s very little substance left.
> But that web doesn’t exist anymore. The dominance of Google means all the creators and authors went elsewhere. Filter out the crap and you’ll realise that there’s very little substance left.
Yes, and that signal/noise ratio will drop even further until some filtering gets done.
But there's a difference between an embedded ad and a native ad. You can allow websites to just add an image or a section of an ad that does not hit a third-party endpoint.
Native ads are not as intrusive and if they become intrusive, like a popup, you can penalize that.
Love this. Aligns with some of my recent musings.
Even better, it would be largely immune to certain recently discovered problems with AI, including ingesting AI-generated content.
There’s quite a lot of dark patterns to watch out for. One is including text on a page meant to be read by your crawler but which is hidden from users by various CSS tricks. To get around this you’d either need to specifically blacklist certain CSS / DOM structures or do something very radical: render the page and then attempt to OCR it back into text to match against the original text.
> One is including text on a page meant to be read by your crawler but which is hidden from users by various CSS tricks.
Shouldn't matter too much, I think. If the score starts off perfect, and can only go down with each thing found (i.e. get penalised), then the problem with adding human-invisible content is irrelevancy, whcih you are letting humans score anyway.
Remember when Google said they would penalize websites that use dark patterns like Quora, Pinterest, Twitter, etc? Google can easily penalize those websites for their dark patterns so websites and blogs by actual people rise to the top but they don’t.
I don’t have any proof. I’m just taking it as a given that this is what they’re doing because those sites are still engaging in cloaking despite Google making it clear they have a policy against cloaking.
Is their cloaking policy still active? Felt like they enforced it at the beginning, but when big sites started using the same dark patterns, they stopped? If it’s an official policy still, would love to hear why they selectively enforce it.
Each char of css scores 1 bad, a char of js is 100 bad, one small advertisement is one million bad, a cookie banner is 5 million bad, news letter modals 123 456 bad, links and embeds pointing at big tech 100 trillion bad.
The pro version comes in 3 kinds: 50 cent, 10 usd and 100 usd per search. For this money you get one or more employees modifying the search results in real time. Pressing the "work harder" button increases the fee by .50,10,100 respectively. You can press it as often as you like.
If you allow logged-in users to down-rank a site, you already did the hardest part, which is to differentiate bots and spammers from real users.
From there, you can easily add positive feedback as well, IMO.
But the first part is really hard. Maybe it needs to piggyback on some centralized government emitted number, which is sad, but maybe the only solution.
I'd love some biometric hash, or similar, but those decentralized IDs have been a promisse for years and have yet to catch on, for whatever reason.
Counterpoint: I'm spending thousands on stuff like APIs to graph out data for my users and would like to be compensated.
The existing system could work with tweaks. I like metrics like time-on-page, load times, and "domain authority". With the right weights on the algorithm, real practical guidelines and a robust system of manual actions (hopefully as transparent as possible) I believe something like Google can last for as long as the internet does.
I used to play with this, a long time ago. I set up Apache Nutch on my home Linux Desktop system to crawl about 20 of my favorite and most useful sites. That was fun, and I was playing with some knowledge management ideas.
If I try that again, I would probably use a Hetzner server and not run it at home. I would probably still stick with spidering only about 20 web sites.
If this search engine ever gets popular, it will just push advertisers from ads to product placements.
That's it. Good luck writing your own algorithm that detects product placements from genuine human experience. (It would need an AI that is smarter than average human.)
"Product placement" sounds vague/subjective - are the posts on Hacker News 'product placement'? I think, on average, most people would prefer a form of product placement to privacy-abusing, experience-debilitating, irrelevant ads.
> "Product placement" sounds vague/subjective - are the posts on Hacker News 'product placement'?
This is my point. You can't write an algorithm to filter them out.
> I think, on average, most people would prefer a form of product placement to privacy-abusing, experience-debilitating, irrelevant ads.
Perhaps. Personally I'd like my ads to look like ads instead of like contents. I'd like to watch a 15-sec irrelevant ads before a tutorial than a video titled like a free tutorial that you need to watch to the middle way to realize a paid plugin is needed.
> I'd like to watch a 15-sec irrelevant ads before a tutorial than a video titled like a free tutorial that you need to watch to the middle way to realize a paid plugin is needed.
If I'm going to be wasting my time, I'd rather it be a single direct waste that I can downvote on search results and even block the creator from my future searches, rather than having my time wasted across all videos, forever. (Additionally without ad-driven algorithms videos might be shorter and more to the point.)
> If this search engine ever gets popular, it will just push advertisers from ads to product placements.
For me, that would be an improvement over the current 'cover content with advertisements', 'interstitials', 'user-tracking', 'privacy-abusing', 'displace legitimate search results' situation we have now.
> That's it. Good luck writing your own algorithm that detects product placements from genuine human experience.
Pretty trivial to detect if the content-author (whether ChatGPT or human) is going overboard. A single page exhorting the benefits of 20 different products?
Unless that page is performing a comparison or review, pretty easy to pick it up just from the links alone. And if it is performing a review or comparison, it's going to get excluded from search results for people not looking for reviews or comparisons.
> For me, that would be an improvement over the current 'cover content with advertisements', 'interstitials', 'user-tracking', 'privacy-abusing', 'displace legitimate search results' situation we have now.
Now, we have possibility to separate ads and real content.
With product placements, there isn't any way and you are forced to absorb all the content to see what is actually ad and real content, wasting your time endlessly.
I've long felt that if we separated the problem we could open the flood gates for a lot of innovation. Crawling should be its own thing. That is much easily distributed and decentralized.
We could then freely experiment with algorithms and strategies on that open database.
> There is an SEO-proof way to determine what the ranking of a site should be
This comment deserves a "my sweet summer child". Everything can be gamed; if you don't see how, then you really shouldn't be suggesting ideas.
How do you define an advertisement? How do you prevent downvote spam (if captchas worked, there wouldn't be bots/astroturfers on the Internet)? How will you fund your search engine?
But many ads and crap today are actually not traditional ads. Don’t they count? If not, then gpt generated text ads or whatever will become more prevalent
> This comment deserves a "my sweet summer child".
I think being condescending really removes any impact from any point you want to make. After all, I posed this as an idle musing, not as a all-in venture, didn't I?
> Everything can be gamed; if you don't see how, then you really shouldn't be suggesting ideas.
If people are gaming the engine by avoiding all the penalties, then that's the exact sort of gaming I would like to see happen.
I mean, you have read through the rest of the thread, here, right?
"This sort of ranking cannot be gamed by SEO spammers, because ... why get their site to the top of the rankings if they can't put any advertisements on it?"
Because it links to pages with advertisements on them?
There are easier and more objective methods for ordering search results.
Alphabetical, chronological, relevance, etc.
People who grew up in a world before Google used these extensively. They were everywhere. I used these in libraries for decades.
"Developers" want to pretend these methods were inferior.
When I think about people trying to game alphabetical business listings by naming their business "A-1" it seems delightfully quaint compared to the sociopathatic "dark patterns" of software developers.
"Rankings", a subjective measure, powered by a secret, ever-changing, hand-tweaked algorithm, are superior because they can be used strategically for commercial gain by intermediaries like Google.
When the intermediary becomes wealthy and influential enough, acting as gatekeeper to vast amounts of information, then the other, objective methods might seem inferior.
Google accelerates the death of objectivity. It's an advertising company, not an academic search think tank.
I had to do this a while back and it's nearly made me punch my monitor. The lowest circle of hell for the person/people who decided putting this basic functionality behind a fucking arcane flag. Few things make me more mad than having my finite life wasted hunting down solutions to problems that have no reason to exist.
In Firefox, there is a way that doesn't require a flag: create a bookmark with the url, and set its "keyword" field to the trigger word you want for the engine.
That works in chrome/edge too. (Even better, Edge matches urls with %s very good, and the URL bar offers past searches as suggestions better than Firefox does.)
They removed most bookmarks and search engine settings, but bookmarks with keywords work fine at the moment, for both mayor desktop browsers.
Also you can store them in a file(!) and import them for a nice cross browser experience, keeping track of them, and not losing them.
I solved this Google problem by paying someone else for search. I have switched to kagi.com, a so-called, "paid, ad-free search engine".
I have not yet found one drawback to using it. Unlike previous attempts to switch from Google, I have not ever felt like I was getting substandard answers or thought of switching back. It's been months.
And, it is lovely. The lack of ads and of it having no manipulative motivation makes me very happy. It has other features that are nice.
Same, it's crazy how easy of a W paying for Kagi is. For a lot of the privacy focused "Google alternatives" you're typically getting privacy in exchange for worse results but on Kagi it's the opposite. They have the !g bang like DuckDuckGo but unlike DDG I've never reached for it. I learned really quick that if Kagi had bad results that Google had even worse ones.
I have low willingness to setup an account and high suspicion of subscriptions. But Kagi was promising hearing protection for Google's ceaseless jackhammering.
My fingers seemed to setup an account on their own and then I couldn't send Kagi money fast enough. My Google inflicted migraine has been dialed down to 3 or 4.
The downside to having my senses back is that Microsoft's repulsive creepiness is suddenly everywhere. I can't keep MS CP+ and gropey Edge out of my personal desktop space. As if their passive-aggressive Azure/360 wasn't enough abuse.
I have been using Brave Search for a few years now and it's fantastic as well. I don't feel the need to use Google anymore and even if I do for images, I just add a !g to the search query and it automatically redirects me to Google.
Honest question - why still use Google at all? Whenever DDG became good enough (seven? ten? years ago?) I used it exclusively. Lately I use a mix and have moved on from DDG, but I still never went back to Google and don't understand how people can tolerate it; I find the results really bad.
> Honest question - why still use Google at all? Whenever DDG became good enough (seven? ten? years ago?) I used it exclusively. Lately I use a mix and have moved on from DDG, but I still never went back to Google and don't understand how people can tolerate it; I find the results really bad.
To me, DDG's results seem relevant-ish, while Google clearly knows my profile and what I'm looking for. I get more relevant results with Google than I do with DDG. That of course comes with the price of my search history and other behaviour online, but I'm okay with that.
Google also comes with some features that is convenient, when searching. Doing a "take away [town]" I get a minimap with pins, so I can see the take away places near me, the menu's are easily available as well as information about opening hours. DDG just gives me an (apparently) unordered list of restaurants in the town I put in, and with my sample search, the two first results doesn't actually provide take away.
Google to me is a convenient choice. I like convenience.
Odd. I don’t get relevant results using Google at all. Sometimes I get results that are in the same topic, but I can’t remember the last time a Google result gave me a result on the first page without a VERY specific query.
Kagi knows what I want (and don’t want) because I get to tell it. My preferences are what I tell it, not what it surmises from my activity.
For many of my searches DDG seems to completely ignore one or more of my keywords, usually giving me something more popular but less relevant. Almost every time when I try the same search on google it works.
Keep in mind that if you take only a specific type of searches to Google, in this case the ones DDG struggles with, it may just be that Google is good at that class specifically (like natural language or a query doing well with long literal string matching or something) and that makes it appear more competent while others, using Google all the time, are annoyed by it a lot of the time
I get the opposite effects on google. As an example, I tried to search for the documentation on watcom’s wasm assembler. It kept giving me results for web assembly no matter how I modified the query nor the Boolean search operators. To me it’s a symptom of search engines trying too hard to be smart and predictable.
DDG is not as good as Google. I made an experiment now. An Italian company I never heard about sent me a commercial email this morning (I'm in Italy.) I looked for it with Google: first result. I looked for it on DDG: not even on the second page, then I stopped looking.
DDG is ok-ish for technical searches, but I remember that I often run those searches again on Google. For anything else it seems confined to the USA. By the way, I'm not logged in into Google even if it probably knows where I am. If DDG knows where I am, it doesn't care or doesn't know how to serve more relevant results.
I've using ddg for years but I find myself unconsciously ending up appending "!g" to almost all my searches. DDG results are just really not good, especially if I'm looking up something local.
I switched to Ecosia, but I’m basically in the same boat. What I’m finding more and more often is that I need both Ecosia and Google results to actually find what I’m looking for.
Because both places give me so much useless crap in the results. Hell… I wanted to buy two nerf guns recently, and I ended up having to go to a 3rd party site called pricerunner.dk because both Google and Ecosia kept giving me shit results from 0 of the Danish toy stores I might actually use. That pricerunner site is terrible by the way, so it speaks volumes when it’s more useful than Google search. I’m not sure why Google search for instance will link me to Amazon.com or Amazon.uk and not an amazon front within the EU where I won’t have to pay import taxes as an example. I have German listed as a language I speak so Amazon.de would seem obvious. Not that I’d ever buy anything from Amazon if it was available from a Danish web store , but it’s an example of just how bad it is.
Finding information was the first thing to disappear from search engines, but now it seems that it may as well bookmark a bunch of stores and go directly to them to begin with. Maybe it’s better in the US but it’s ridiculously bad in Denmark.
Pricerunner is not great, but I'm in the same boat. I find myself using it more and more, because it's pretty much the only good way to find product while filtering out the most ridicules storefronts, like Fruugo or VidaXL, at least in Denmark.
I switched to Ecosia two years ago, from DuckDuckGo, and I really don't find myself going back to Google all that often. There are some search there I figure that something should have shown up, but mostly Google can't find it either.
Ecosia started mixing in Google results a few months ago, but I don't feel like it improved the results, if anything it's perhaps a little worse. That's anecdotal and not based on any actual tests though.
Kagi does look more and more interesting, because the issue seems to be the focus on ads and pushing stuff for people to buy. I just feel like Ecosia should be doing more to limit purchasing, given their environmental focus.
I think you're right about local because small business owners have an incentive to optimise for Google local results in a way they don't for DDG or any other search engine.
Generally though, I find DDG good enough for everyday use. However, a friend of mine once told me she looked up a breed of bird on google for more information and I could barely contain my laughter when I suggested she should have used Duck Duck Go instead. The strange look I got in return reminded me that, for 99.9% of people, Google is 'Search' and that's the end of it.
Weird, I never ever use Google. When I do out of desperation (like when DDG was down two days ago) I find its results atrociously bad to the point of being almost useless.
Not to discourage you from paying for Kagi, but I use https://github.com/iorate/ublacklist to remove Pintrest and some other domains from my search results and it works perfectly.
DDG is my default search engine, but there's some categories of things it's still not great at and I fall back to google (with some ublocking to disable "features" I don't want) for that.
`||google.*/complete/search$xmlhttprequest,important` turns off the autocomplete, for example.
Try using Google like an average person. I Googled "best places to visit in Peru", and Google gave me a little drop down right there on the page that listed a few, without me having to go elsewhere. DDG did not.
Google wins. Even if it didn't, at this point DDG would have to be a lot better for the average person to switch (with that name, especially)...it just isn't.
HN is such an echo chamber with this that sometimes you might forget that Google is the best search engine for a normal person. Maybe they will footgun it with AI, though.
you are just used to google, it doesn't mean it's better than others. I switched to DDG several years ago, now I'm used to it. Sometimes when I go back to google i think the results are just bad, because I'm used to how DDG works now. Google just shows you always what you'd like to see, this is not good at all, you just live inside a walled garden.
Worth emphasising IMO. People are conditioned to using Google and its quirks, that can make alternatives engines a little less intuitive. Also they're often criticised for overrunning verticals with their own offerings. My favourite example being celebritynetworth.
Beyond 10 blue links, certain niches organic results have been pushed down the page and placed under G specific results (when I say certain, nowadays it's most).
Celebritynetworth is an example of this where a site had some unique content, G apparently realised that lots of people search for such stuff, asked the owner for an API and then eventually scraped it. A rough version of events, more details below.
There are lots of examples of well performing sites/niches where similar has happened.
You're better off using Google Maps for this use-case. At some point, the heavy query rewriting becomes unbearable.
The end result of the algorithmically generated SEO spam which Google keeps indexing is that fewer people in enterprise environments will ultimately jump through the hoops to change their default search engine in edge.
It is the same thing discussing BSD/Linux distributions as alternatives, that same average person is going down to the shopping mall on the city center, and buy whatever is on display on the computing stores, after getting some counseling from an high school student working part-time on the store.
The entire article is about how there is a new and easy way to get rid of Google's "little drop down right there on the page that listed a few, without me having to go elsewhere."
Yeah recently I was researching some articles on the Beatles, and I used Google search to find the lyrics - they were wrong, which I knew because I knew the right lyrics I just needed to copy past them.
But really I think the day that Google is the best search engine for the normal person are gone because Google will return stuff authoritatively that is wrong.
The normal person is by definition not a subject matter expert and cannot know if what they got from a search is wrong without further research.
Not to mention that Google is also becoming hostile to VPNs. I always run my phone through one and Google w/ FF Mobile has just been putting me in the captcha loop.
Happy Kagi customer here and pushing it where I can. The results are much better than DDG or Brave search, which I both tried for a few months, and it's got some really nice built-in tools to bump domains up or down in results, block domains, etc.
They are also better than Google search. When I was using DDG, I still used Google from time to time if DDG's results were not helpful. After I switched to Kagi, I did the same at first but soon realized that when Kagi gives you bad results, Google's are even worse.
It won't be long before google removes or butchers this feature. Like the feature that used to exist where you could easily block domains from being returned in a search, I guess it was too popular, as this "web" view likely will be too
Unfortunately it seems that Safari hijacks all google.com/search?q= urls and strips all (or at least utm) query params, except q, when google is set to the default search engine!
Even pasting in https://google.com/search?q=apple+sauce&utm=14 to the url bar doesn't work. The only way I was able to get this to work was set DuckDuckGo (or anything non-google) as the default search engine, then use the extension "Smart Keyword Search" to setup a shortcut. I have !g <query> for normal google search and !gw <query> setup for google "web" search.
Is it worth it? Should we fight against search engine with workarounds like this? Looks like AI is just another problem with Google Search. But maybe I don't feel this pain like you do. I've fixed them all about a year ago by switching to Kagi and never looked back.
Well, maybe at the beginning a little bit, but I like Kagi results more than Google now.
It gives some signal that users want a simple web search, but ultimately the HN crowd is a small minority, so it'll be statistically insignificant. I wonder how many strings were pulled to get the web search in in the first place.
I don't mind the info boxes,... I get mad at google when I input two words into the search field, press enter, and the first few results don't include one of the words (50% of the search! .. even include the "show only links which include..."), and then uses synonym results for the second word, which gives out totally wrong results.
Adding this udm=14 param does seem to do a better job with many queries I tried. Tried "python oracle extension". Without it, the top results are all for cx_oracle, which is deprecated and has a different name now. It's still #1 with udm=14 added, but the right answer jumps up where it's at least visible on the page without scrolling down.
> I wish you could make verbatim search the default option.
You could create a local html file containing a form that posts to google, then add a minimal amount of js to intercept submission and add the quotes, and set it as your new tab page. I do this myself for a site:-specific search.
It infuriates me that exact match searches don't work anymore. I understand that for the average person/average query, the modern google results are better (with info boxes, AI type stuff going on), but please let me type in "I want this exact phrase, including punctuation!". I don't even care if it tells me there's nothing found and then makes some suggestions below that.
Google has restored 'verbatim' search, as a drop-down beneath "all results." If you'd prefer to add it to a bookmarklet, the URL parameter is tbs=li:1.
I made something that achieves the same thing[1] except it uses Google's Programmable Search Engines. It's self-hosted and you can configure it the same way as the article using the `%s` string substitution.
I like the web feature, specifically when you are researching. Can understand the "hiding" at this point, as I guess they are still figuring out the best place for it. Sundar Pichai in some interview said that generative search only appears on some specific queries. If that is true, and they have some logic to determine which experience to show (i think i might be wrong here), then would love to see "web" option as default or in the top strip for research type queries where it's most needed.
For Firefox I recommend installing an extension "search engine helper" and follow the steps in their help menu. This will add a modified google in your firefox settings. You just need to select that as the default before using it.
I also use the Vivaldi browser, and the trick from the article (adding %s&udm=14 to the search URL in the search settings) works just as advertised. Such a pleasant little QOL improvement.
There is an SEO-proof way to determine what the ranking of a site should be - penalise it for each advertisement, penalise further for delivering different content to the crawler[1], allow logged-in users to down-rank a site, etc.
Basically, a site starts off with a perfect score, then gets penalised for each violation, for each dark-pattern, for each anti-user decision they took.
This sort of ranking cannot be gamed by SEO spammers, because ... why get their site to the top of the rankings if they can't put any advertisements on it?
Even if they do manage it, their time in the sun will be brief.
[1] Random spot-checks with different user agents with different fingerprints should work.