I'm just glad I'm not alone, sometimes I second guess if *I* am the problem. Cur...

int_19h · on April 1, 2020

The problem isn't just SEO, it's that Google itself aggressively rewrites queries to produce more results (which I suspect they want to do to show more "relevant" ads).

On the most extreme end of this, I've seen four-word queries produce results, in which three of the words were stricken out. More often, it's just one word, but it's usually exactly the one that makes the difference between a very specific query, and a very generic one.

Worse yet is that they try to do synonym substitution, but their algorithm has a ridiculously low bar for that. Like, you might be searching for "FreeBSD", and it will substitute that for "Linux", or even "Ubuntu". Or search for a specific firearm model, and it finds "gun".

Quoting keywords suppresses all of that, but synonyms are actually useful - if it did them accurately...

KMag · on April 1, 2020

I left Google in 2010, so it's just a wild guess, but I suspect a big part of the issue is learn-to-rank is probably being trained on everyone's searches. I think it would probably do much better if they used the presence or absence of search operators as a simple heuristic to separate power user searches from common searches, and trained a separate ranking model on power user searches.

Maybe they're already doing this, but it sure acts like learn-to-rank is always ranking pages as if the query were very sloppy.

It's been a long time, and I certainly never read the code, but I vaguely remember a Google colleague mentioning something (before learn-to-rank) about a back-end query optimizer branch that would intentionally disable much of the query munging if there were any search operators in the query. There was some mention about using cookies / user account information to do the same if the same browser/user had used any search operators in the past N days, but I'm not sure if that was implemented or just being floated as a useful optimization.

lowdose · on April 1, 2020

Google image search is also not searching for a duplicate but what object the ml recognized on the picture. For image search I switched to Yandex.

rathel · on April 1, 2020

I wonder if this synonym substitution was their use case leading to invention of word2vec.

redwall_hp · on April 1, 2020

I think it has to do with shifting expectations. All of us who use the Web seriously, and have been for years, want a full text search engine. The average user wants what Ask Jeeves promised to be: something that takes vaguely question-shaped queries and spits back a fast answer. Or a glorified URL bar to outsource memory and effort.

not2b · on April 1, 2020

No, you don't want a full text search engine. If you think you do, you don't remember the pre-Google world. It was impossible to use the older search engines to find a reasonable explanation of a common topic, because to Alta Vista and other search engines of that era, every page that contained a given term was considered equal to every other page, and it would give you all of them in a random order. You could add lots of AND and OR to try to exclude what you didn't want, and this might cut you down to 40 or 50 pages to go through to maybe find what you want.

But when Google first came out, it was a shock. You could just search for something like "Linux", and the most authoritative sites all showed up on the first page.

userbinator · on April 1, 2020

and this might cut you down to 40 or 50 pages to go through to maybe find what you want.

At least those search engines gave you that many results to go through... now Google gives you less than that, full of spam (despite the index probably containing far more), and you'll be in CAPTCHA hellban if you try harder to get to the rest.

miracle2k · on April 1, 2020

A full text search with a good ranking is still a full text search. The point here is that Google used to do the job just fine, but no longer is.

kevin_thibedeau · on April 1, 2020

AltaVista's killer feature was the NEAR operator.

jkaptur · on April 1, 2020

Yeah, I'm sort of surprised that there isn't a semi-popular "web grep" tool for people who would rather use regex, some understandable ranking algorithm with knobs to tweak, etc.

Of course, you'd have to read a manual to use it and it would have a ton of spam, but some people just want lower-level control - they still sell stick-shift cars.

zo1 · on April 1, 2020

Not just that, but the sheer scale of such an index. The size of the web now just makes anything small next to impossible without a lot of funding. And none of the existing search engines will probably allow you programmatic/data access to their index without a metric ton of cash

artificial · on April 1, 2020

How is it that spiders/bots are able to "index" copyrighted content? Is it just one of those things where the ends justify the means or a holdover/tradition or some such?

kd5bjo · on April 1, 2020

It’s some combination of fair use and raw data not being copyrightable. My understanding is that only the creative expression that’s copyrighted, and not the actual words. So, if you distill out all of the creativity into something that’s purely information about the work, you’re probably fine copyright-wise.

There’s a long tradition of compiling and publishing concordances, which are just indices of every place each word appears in the original text. They’re generally not useful without access to the original, so noboy seems to mind them very much. Google’s index is just a modern form of the same thing.

sefrost · on April 1, 2020

I wonder how many pages would be worth indexing for such a tool though.

zo1 · on April 1, 2020

Probably a tiny subset. But the problem is finding that small subset!

nradov · on April 1, 2020

Basically now we're back to forcing Google to work the way AltaVista did in 1997.

neotokio · on April 1, 2020

I think there are few things to consider here,

1) Google CANNOT provide you with technical search, because choice of index/query filters is always limited (ie. Do I prefer exact matches over multiple matches?)

2) Google has shareholder & public responsibility. It means that service is adjusted (and it's 'algorithms') towards biggest type of queries performed.

All of this is a constant battle between precision and recall for given query. Adding to complexity, Google needs to account for

* Extraordinary amount of users using their search

* Extraordinary amount of data on webpages

* Importance of authority

In smaller search engines (ie. shop full-text search) you usually adjust towards one use case. This in itself is already hard.

Google does that for all possible use cases, for all possible queries while still fighting same precision/recall battle.

To be clear. I think google is terrible, but I also think that there is no other option for them at this point.

All of this became clear for me the moment I've got interested in build search and relevance engines.

MaxBarraclough · on April 1, 2020

Presumably google are in an arms race with the spammers, like all search engines are.

Is there a superior alternative to Google?

wizzwizz4 · on April 1, 2020

DuckDuckGo does something fancy with smooshing together Google and Microsoft databases with their own to create a half-decent search engine.

Cliqz has hardly anything indexed at the moment, but it actually gets relevant results from those. (e.g. "zoom privacy" brings up the Zoom privacy policy first, then three news headlines from the last 12 hours, then a news article from yesterday, then an IT@Cornell guide for making Zoom meetings private, then some more news articles, some stuff about HIPPA…) I really like it, even if it isn't great for programming at the moment.

[DuckDuckGo]: https://duckduckgo.com/ [Cliqz]: https://beta.cliqz.com/