Hacker News new | past | comments | ask | show | jobs | submit login
Google’s “million’s of search results” are not being served (serpapi.com)
265 points by vincent_s on Oct 3, 2022 | hide | past | favorite | 125 comments



I was curious, so I tried to do this with GitHub search.

The search “Golang” returns 225k+ repositories. Each page of results has 10 repositories. I queried page 1000 to see the 10001st-10010th results and got a 404.

I searched Wikipedia (using its own search page) for “America” and was told there were 1.9M results. I requested the 10001st result and it failed with this message:

“An error has occurred while searching: Could not retrieve results. Up to 10000 search results are supported, but results starting at 10000 were requested.”

There was an HN thread along these lines about web search somewhat recently, with people making confident accusations about false advertising etc.

But it really seems that everyone does this. I don’t know, maybe it’s still wrong or misleading.

But personally I’m ok knowing “there are X hits and we will serve you some of them”.

EDIT: I just searched the Library of Congress for “god” and was told I was viewing results 1-25 of 324,782. Sure enough when I asked for page 10,000 I was rebuffed. I really don’t think this is a Google thing


It is easy to think that this is normal. And I can think of many non crappy reasons that it became so.

That said, I do think it hinges on dishonest. Even if well intended, there is absolutely no faith in thinking that there were actually that many meaningful hits on any of these searches. Nor is there really much gained by presenting such large numbers.

I can almost buy that it is intended directionally to help refine search terms. But, the numbers are so silly large that I don't. Especially as there is no way to see the last results to know exactly why they count as hits, but are not worth serving.


If you look at the number of results rather than the results themselves, you clearly care about the number. The number is the information. It can answer a question like: How many books in the Library of Congress contain the word "God"?[1]

Similarly whenever I've looked at the number of results in Google it was to judge how common something is. For example, if I want to know the most commonly used spelling between two options, I Google both and check the number. That's what the number is helpful for, and really the only thing it could be helpful for. It is its own information.

[1] I know the question might need to be revised for the number to reflect the answer accurately; you get the idea.


But that number has no way of inspecting it to know what it actually means. Is it including prefix searches? Suffix? Whole word only? Possible misspellings? Acronyms? Translations?

I get that it can be somewhat directional, but I question that anyone is getting meaningful data out of it. :(


Also, some pages are mirrored all over the internet.


You've just opened a can of worms. It raises a very interesting question: when asking something like

> if I want to know the most commonly used spelling between two options

do those mirrors count as "use" of the word? Or does "use" refer only to the original writer?

If a neologism is written once and read by millions, it carries more weight than one read by only a few people. Perhaps the same goes for spelling variants: the more people see it, the more people agree it's the typical spelling. Lots of mirroring could be a decent proxy for lots of readership.


You can in some cases compare different keywords to each other.


The number is not _the_ information. It's just _some_ information. When I search the web I want the content, and very very rarely care about the actual number of results. So yeah, showing a search result count they can't actually serve is dishonest and gives a false sense of reality.


No, in context of what I said (when the number is what you're looking at) the number is the information. If the content is what you care about, the number being 2 million or 5 million is meaningless because you'll never get through all the content.


This reads more like a sign saying "over x customers served" on a fast food restaurant. It doesn't really matter, nobody will ever care about the accuracy of the information, but it's a pride thing and makes a point about the establishment.

Obviously there aren't millions of meaningful hits on most terms, but you don't need to go to page 10001 for that. You can get to page 3 and know it for sure.


I'd argue that, to a layman, if it is obvious that there aren't that many meaningful hits, then there weren't that many hits. :(

This does call back to the odd false confidence that our industry bakes into the interview process. "Design a realtime chat program that can notify any number of followers that you posted and let them respond with sub millisecond latency."


I can’t push a single key in less than 20ms latency. Those poor interviewees are building Twitter but for bots. Unfortunately, that is probably a relevant skill…


The number of search results is plenty useful. Even for grammar! If spell checker isn't of help, use Google and see which option is more popular (I used to use it often when I wrote more in my native language, which has a few peculiarities). It can give valuable insight in popularity of different things.

I do hold the number of results trustworthy (not exact, but a good estimate). Having the index in the Inverted Index format, and efficient joins (solved problem), it's just about getting the size of the term list and summing over the shards. Something that Google anyway has to do for the regular retrieval.


right, I would understand "found 10k results but can only show the first 1k, please refine your search".

That's reasonable. But as you said, what they're doing now is bordering on dishonest.


Library of congress is quite explicit about this on their first search page, rather than showing an irrelevant number. So, having limits isn’t unusual, but not everyone is deceptive about it.

You Searched: ALL: God

Your search retrieved more records than can be displayed. Only the first 10,000 will be shown.

Titles List: 1-25 of 10000 https://catalog.loc.gov/vwebv/search?searchArg=God&searchCod...


Yeah - as someone that has run production search clusters before on technologies like Elastic / open search, deep pagination is rarely used and an extremely annoying edge case that takes your cluster memory to zero.

I found it best to optimize for whatever is a reasonable but useful for users while also preventing any really seriously resource intensive but low value queries (mostly bots / folks trying to mess with your site) to some number that will work with your server main node memory limits.


I guess they have multiple search interfaces.

I navigated to loc.gov and searched directly from the home page.

I was sent to this page: https://www.loc.gov/search/?in=&q=god&new=true&st=

At the top, it says “Results 1-25 of 324,782” and does not mention the limit of 10,000 anywhere.


It does once you get to the 4,001st page of the results (which is actually 100k items): "Sorry! We can't process this request. This request exceeds the maximum search results depth." The error page even includes a link to "LC for Robots" (https://labs.loc.gov/lc-for-robots/), a list of APIs, since few humans are going get that far on their own.

The number here also seems less misleading because all 324,782 results do seem to exist. It doesn't want to generate a pagination for the entire set, but you could get to them by choosing different formats, date ranges, collections etc. The number Google repors, as far as I can tell, needs to be taken on faith.


Ahh, looks like the limit on results is 100,000 items on that page. https://www.loc.gov/search/?q=god&sp=4000

But you can subdivide the search results by date and get every item on your original list: https://www.loc.gov/search/?dates=1890/1899&q=god&sp=1808

So, they have and can show you every single one of those 324,782.


My guess is the estimate comes from term frequency index which is pretty easy to build. Estimates of that can come from HyperLogLog or similar.

Asking for page 10,000 is asking the search engine to search and _rank_ 10,000 * 10 results and give you the last one. That's very expensive and ultimately useless - search is about finding what you're looking for on page 1, not on page 10,000 :)

So it is true that there are 225k+ repositories using Golang (you can compute that with an index scan once a week), but searching them is an entirely different problem.


If they cannot give you the page results how can you trust page 1 is better than page 34557


Huh? It's a ranked retrieval model. Each result has scored a little bit worse on their relevancy function than the one above it.

To not trust that the results on page 1 are better than those on page 34557 is the same as saying that their ranking function does not work at all, which would mean that it's at best as good as random chance. That's clearly not the case, therefore I can trust that page 1 indeed has more relevant results than page 34557.

With that said, Page 34557 doesn't exist. And that's fine. The result count estimate is not based on the actual ranking that has taken place (at least not directly). It would be an absolute waste of resources to rank that many results. If you cannot find what you're looking for on the first page, then it's much easier to reformulate your query. Easier for you because it gives you more control over what you want your search results to be and easier for google because it only needs to rank a couple hundred results instead of a bajillion.


Well, one could claim that indeed their ranking function does not work well - at least recently. Stuff that is relevant rarely is showing up on the first page as it is losing to various spam sites having articles written by AI so that it can match the typed search query and get a good ranking, but articles themselves being misleading and incorrect. I remember the days when you could really dig deep into results pages. Maybe not 34557, but above 100 you could get niche human written content on the interesting topic.


Yeah, I used to pretty frequently click through, and find relevant/useful content, some 20+ pages in on paginated web search results. Well, unfortunately the major search engines have changed their functionality such that there's usually no point in even looking beyond a couple pages (or, arguably, even looking at the first page for that matter).


If you search a famous name like Joe Biden or Donald Trump, it might be useful to read the 10,000th thing written about them, wouldn't it?


Like the others have said, Google wants you to refine your search (e.g. "Joe Biden foreign policy" instead of "Joe Biden"), rather than dumping low quality results on you. You don't have to agree with Google's rule by switching to other search engines, but if you do use Google, abide to that.


Ehhh, I write very specific search terms, with quoted elements, and I still get shockingly irrelevant results. I'd literally rather get a "no useful results" response rather than hundreds of pages of results that literally DO NOT CONTAIN the terms I typed into the search box.


If you put the search terms in quotation marks it will only show results where that term appears verbatim on the page. Also searching -keyword will exclude search results where that word is present.


But it really seems that everyone does this

If everyone lies, that doesn't make it right to lie, no matter the technological barrier.

We should't accept and normalize this behavior. If a search claims there are six million results, I should be able to see any or all of those claimed.

If Google (and the others) can't let me see all the results, be honest and tell me. "Google found 6,553,500 results. Showing the 5,000 most relevant."

Google advertises that it has x results, but there is no way to know if that is true, or a lie.

Is it really so hard to not lie to your users?


The issue is that the number of results is often an estimate for performance reasons. So the results should say something like "around X results" or "approximately X results" to make that clearer.

If Google or some other search-based service had to check every document for a match to verify that the results were accurate (to avoid false positives and negatives) then 1) search would be slow; 2) it would be expensive; and 3) the search couldn't handle many requests (as that would kill the database server).

What search engines/databases tend to do is make use of fast and efficient lookup tables or indices, then perform operations on those results (such as joining between different tables) depending on the particular search terms/options (e.g. if you are searching for a specific content type or timeframe).

There is a lot of complexity in making the numbers and results both accurate and efficient.


Because as you skip pages with most pagination implimentations the DB has to scan more index keys. If they allowed paginating so far, without some kind of expensive preaggregation, you could easily create a DDOS attack.

?afterId=x will perform much better than page based pagination and doesn't have this problem as much, but it can still cause undesired io and cache activity in the DB when you start paginating a lot to unpopular content.


> But personally I’m ok knowing “there are X hits and we will serve you some of them”.

That's not what it is, or at least that's not what they're trying to convince people it is. Otherwise they wouldn't the numbers on the last page.

    Page 21 of about 16,890,000,000 results (0.82 seconds)
    Page 22 of about 214 results (0.95 seconds) 
I don't think other companies doing the same thing make it okay if it's not reasonably known puffery. That said I suppose it's not false advertising because you're not buying a search.


Not sure what GitHub uses internally, but Elasticsearch has a default limit of 10,000 records unless you update the index with a parameter. I imagine a lot of apps have a limit for this reason.


Considering loading that many results, to drop the first N is a lot of overhead... 10,000 is arbitrary, but pretty common at a level as no reasonable person really wants to see this far into weighted search results.


> I queried page 1000 to see the 10001st-10010th results and got a 404.

Yeah, the GitHub API also only serves 100 pages with 100 results each max. If you want more, you need to use dirty hacks like slicing the search into blocks by sorting and filtering based on the creation date or the number of stars of a repo.

Be careful though, in my experience the results are not consistent anymore as soon as you enter low star or old repository territory, probably depending on the actual API server you hit, forcing you to query multiple times to really get (hopefully) all results. :)


If you're using Elastic then it's inefficient and discouraged to allow accessing random pages (ex. requesting page 100 before having visited pages 1-99). I.e. this could be a technical limitation and there aren't use-cases that require accessing those pages arbitrarily. For Shodan, we allow random access for the first 10 pages but if you want to go beyond that then you need to go in order so we can use a search cursor on the backend to more efficiently go through the results.


I think if you jumped forward by tens, you might see some results on some of those sites. Skipping forward in a search index is exactly the same on the server side as moving incrementally, and they likely didn't want to do 10000 pages for one request.


Reminds me of a rookie mistake I made - caching.

I had a key-value store populated from map-reduce jobs. Lookup a key and return cached result. Refresh cache every X days. With increasing workloads, each job took longer pushing X higher and higher. So some results you'd see days old. not noticable to most.

Is it possible here that, the "count" here is stale ?


I noticed you get the thousands of results when you search logged out, incognito moode, and just a few hundred results when searching logged in with a google account.

Maybe they are saving storage on the amount of results as they serve and track you on the web :-)


Mentioned previous related discussion is here https://news.ycombinator.com/item?id=32777737


From LOC website:

"Note about deep paging limitations

Due to the technical limitations of search engine technologies, it is not recommended that users page through a large number of result pages. If the number of result pages is excessive, it will be better to use faceting or more specific search terms to reduce the result set. Paging past the 100,000th item in a search result is not supported at this time. In some searches, responses may fail before 100,000 items."

https://www.loc.gov/apis/json-and-yaml/

This is what the OP states:

"A misconception regarding Google's search results is that all of the results are being served to the user conducting that particular search. Those 2 billion search results can't be gotten through Google's pagination, and it seems that this number is somewhat arbitrary to the search, or commonality of the keyword."

That seems accurate. One cannot retrieve the full number of results. The parent demonstrated this with some examples.^1

The pertinent question IMO is how many results can one retrieve. As someone who started using www search engines in 1993, that number keeps shrinking. IMO, this is a reflection of companies like Google seeking to commercialise the web for their own benefit. Google wants the web and paid advertisement to be synonymous.

With Wikipedia or LOC, one can retrieve more results that one can using Google. More importantly, the sorting order (ranking) is different. Not sure about Wikipedia but no one uses "SEO" for LOC. It is a curated collection, unlike the uwashed web. In the pre-Google era, people would choose the name "Acme" for their businesses so they could be listed first in the Yellow Pages. Google does not allow alphabetical listing. The reader should be able to figure out why. It's because the web is not curated. It is not a library. Google is only interested in what sells advertising.

Using Google from the command line (no cookies, no Javascript)

   "golang" 457 results max
   "god" 576 results max
   "america" 448 results max
Those numbers are laughable if one sees Google as some kind of "oracle" for open-ended questions, a gateway to the world's information, or even to the contents of the www, as many seem to do. It is not even close. It is a filter. The filter has a purpose. The purpose is commercial.

Of course Google is useful but one is kidding themselves if they believe Google is anything like a library. Libraries (public, academic) generally do not subsist on selling advertising services. Library websites and Google may use computers and similar software that have limitations but that does not mean they share the same principles. There is no reason to believe LOC would not serve all www users with more than 10000 results if the software allowed it. LOC is not promoting some results over others based on commercial objectives.

The parent may be referring to this recent HN comment allegedly from a former Google employee:

https://news.ycombinator.com/item?id=32785079

Google fans are happy if the reader conflates (a) technical limitations with (b) commercial objectives, e.g., "secret algorithms" for ranking results. Anyone using computers and software will be subject to (a) but not every entity using computers and software to assist patrons with searching its catalog (database) must engage in Google-like behaviour.

1.

YMMV, but I found Github would only return 90 results max for "golang" when searching from the command line (no cookies, no Javascript).

Wikipedia caps results at 10000. This is stated on the website.



> EDIT: I just searched the Library of Congress for “god” and was told I was viewing results 1-25 of 324,782. Sure enough when I asked for page 10,000 I was rebuffed. I really don’t think this is a Google thing

This is a common problem called deep pagination, and it basically stems down to how most search systems work. When you make that search request on loc.gov, it eventually turns into a complex request for Apache Solr. Answering that request requires Solr to find all of the matching documents which match the query and any other active filters (e.g. document type, date, etc.), and calculate their rank in the search result set so it can determine which specific results are numbers 10,000-10,050. Solr/Lucene are really mature and well-tuned but when you think about how many millions of documents there are in that index, this is still calculating a pretty large temporary set only to serve 50 records, throw away the other 324k, and go on to the next query. Your next request might arrive a second later but get routed to a different replica so it’s a cold cache there, too. Click one facet on the filter list, and the search engine has to recalculate the matching documents again.

Don’t forget that you’re often seeing the results of more than just the metadata. You often get the best search results by treating entire books, newspapers, etc. as a single group for the purposes of ranking and then displaying it as a single entry on the top-level search. I wrote about this a while back: https://blogs.loc.gov/thesignal/2014/08/making-scanned-conte...

There’s an alternative which is considerably more efficient: cursor-based pagination, where instead of saying count=50 offset=10000 you say after=<unique ID/time stamp on the last result>. That allows the search engine to efficiently ignore all of the records which don’t match that, avoiding the need to calculate their rank, but comes at the cost of breaking the pagination UI: sites like loc.gov give you numbered lists of pages which humans like but you don’t have a way to calculate those after= values so your faster search isn’t as popular with advanced users.

That last part gets around to the other big factor in decisions: relatively few people do this. Most of the time, people are going to adjust their search if they don’t find what they’re looking for in the first few pages. What does do this a lot are robots which crawl every URL on a page, generating millions of permutations for every filtering option you offer. Some of those robots honor robots.txt (perhaps even correctly implemented), some have accurate user-agents, etc. but many do not and basically everyone offering a search service with a large corpus ends up having to make decisions about rate-limiting, limiting pagination depth, capping the number of filter options, etc.


So clearly as a society we've forgotten how to count ¯\_(ツ)_/¯


That's a general thing I've been noticing.

Maybe it's long COVID or quiet-quitting or was it ghosting? I know it's not gaslighting. I'm trying to keep up, bear with me.


I can't even tell what this blog post is complaining about, it seems so badly organized and written.

But it seems obvious that if "coffee" gives 2 billion results that, no, you're not going to be able to browse to page 187,398,384 to get those results. There's no use case for that for any normal consumer (as opposed to competitor, researcher, etc.). If you're capped at browsing the first 10 pages or whatever, that's entirely reasonable.

> A misconception regarding Google’s search results is that all of the results are being served to the user conducting that particular search.

That misconception lies only with the author. Nobody's being "served" 2 billion results, I don't even know what that would mean. The number of results being reported is quite obviously in order to allow users to judge the breadth of search queries. If it says 2 billion, you might want to refine. If it says 15 and they're all useless, go broader.

(It's useful for researching item popularity too, although that's been superseded by Google Trends which is built specifically for that.)


> If you're capped at browsing the first 10 pages or whatever, that's entirely reasonable.

Not to me. I’ve tried a bunch of times to look for obscure shit I’ve seen before (and know exists) but have bumped up against the limit. This is especially annoying when the thing I’m looking for vaguely sounds like a more popular topic, and so the first 10 pages are just about the more popular version.


If you can't find it after 10 pages (100 items) you likely couldn't find it even after 100 (1,000 items). It's diminishing returns, so more pages isn't the solution.

The solution to avoid the more popular version as much as possible is to exclude keywords associated with it, and/or to add required keywords associated only with the thing you're looking for. Exact string matches ("go programming language" rather than "go") help too.


Any kind of news event can be repeated by dozens of articles. It’s a really absurd system and it also means you can’t easily find alternative perspectives on events.

Searching by date range doesn’t work either since it brings up current articles as well.


For many search terms the first few pages are nothing but advertisement and sites gaming Google search. I'd love to go beyond page 10. Not with Google, even if says "100k results found". It just doesn't bring them money to link to those pages


Pre-Google, it was pretty common to have to dig several pages into the results from early search engines to find what you were looking for. Seems like we’re just returning to the bad old days.


Even earlier google had lots of useful results on the 5th page. Google was better at finding me what I was looking for when they were a search company breaking into the ad space. Now they are an ad company with a vestigial search segment who's only modern purpose is to be a page to serve ads on.


> I can't even tell what this blog post is complaining about

They’re not complaining, they’re advertising their own (paid) service (which serves Google results programatically) while at the same times being able to point to customers why they’re getting fewer results than a regular Google search (they aren’t, because Google’s number doesn’t reflect what you can look at).


I agree that it's very poorly written. The heading is especially confusing, it took me a long time to understand that "SerpAPI" was the search term, not a Google service. It is also the name of the company he's promoting, which makes it even more confusing.

But I don't agree that it's "obvious" that "you're not going to be able to browse to page 187,398,384 to get those results. I would argue that a search result is only a search result if you can actually view it. If not, it's just a marketing, or statistics.


Feels like SEO blogspam trying to capitalize on Google's currently negative public profile after the Stadia cancellation, and general negative google sentiment.

Really easy to get clicks (and upvotes on HN apparently) by complaining about literally anything Google related.


Maybe but it's from 2021.

Personally I'd prefer something like "Returned top N of 166,000 results" and easy access to those N results (first/best match, last/worst match, paginate left/right)


why do you think this? nothing in the article indicates this, and Google's questionable search results (compared to what they once were) has been a nearly constant topic of discussion for years now


> Google's questionable search results

This is exactly my point. This article spends a thousand words to say "google doesn't actually let you see every result from the billions it claims to have."

Okay cool, I'm not gonna read a billion search results anyways, that's why I ask google about a topic.

The writing is poor, the "findings" mundane, it's a marketing fluff piece that tries to convince you there's a problem and then plug their service as a means to solve it.


I’m not conspiratorial about this, but I’m confident that an article like “Google doesn’t actually serve you all the results” drives more clicks, engagement, and outrage than “literally no sufficiently large search tool gives you all the results, including Bing, DuckDuckGo, Wikipedia, GitHub, the Library of Congress, etc.”


I don't think that's obvious at all. This first came up a while ago here I think. Even after having the time to think about it, it's still not obvious.

The search page for coffee says "Page 1 of about 3,600,000,000 results"

How are we to know that we can't load page 360,000,000? Maybe it's obvious if you're familiar with search engine internals/algorithms, which the vast majority of people won't be.


> I can't even tell what this blog post is complaining about, it seems so badly organized and written.

That's because it's an ad for the service, not a real blog post.


Most people never go past page 1 of the results.


With: https://www.bing.com/search?q=dog+food+nutrition

-----

https://www.akc.org/expert-advice/nutrition/soy-in-dog-food-...

is the 22nd result, 31st, 45th, 59th, 73rd, and plenty from there on out.

-----

https://www.petmd.com/dog/nutrition/can-dogs-eat-peaches

is the 10th result, 20th, 34th, 48th, 51st, 66th, 70th, 88th, 93rd...

... and the pages start repeating from page 9 on with the previous "Soy in Dog Food?" on top.

-----

Have the search engines given up on search? Also, can they really say they're being gamed if they're serving the same results up multiple times? Seems more like they're picking winners.

This search is pretending like <40 results is hundreds of results, and someone had to make the effort to make sure that the same link wouldn't show up in the same page of results. That strikes me as a deliberate falsification.

edit:

As far as I can tell from

https://www.google.com/search?q=dog+food+nutrition

it seems like google are doing a hell of a lot better. Google is also banned from setting cookies or using localstorage on my machine, so the result isn't search bubbled. Nothing about soy or peaches, and no repeats are jumping out at me even 100 results in.


I was expecting the number of search results here to be much higher - like who cares if Google only serves the first million results out of a billion?

Very interesting to see that Google will only serve a few hundred links when they claim to have hundreds of thousands of relevant results indexed.

I'm very curious where Google is getting that count and why the reality is so different. Systematic overcounting? Suppressing hundreds of thousands of results?


The problem is generally called "deep pagination". It's extremely inefficient to compute.

Specifically, counting requires very low memory. When data is spread across 10,000 computers, all of them counting returns just 10,000 numbers i.e. 4 bytes * 10,000 = 40KB. It's easy for 1 computer to count those 10,000. Even at 100,000 computers 400KB.

Merging sorted search results is extremely memory intensive. Even with just the Id+Score pair, let's say 8 bytes. To get the 10,000th search result, each computer needs to create a List of 10,000 results, thats 10,000 * 10,000 * 8 bytes = 800 MB. For the 100,000th search result 10,000 * 100,000 * 8 bytes = 8 GB. OR if your data grows to 100,000 computers, thats 100,000 * 100,000 * 8 bytes = 80 GB of intermediate results to process at the end.

As you can see this doesn't scale well. You're required to retain context (i.e. sessions) of the search in memory instead, and get the search engine to better coordinate across all 100,000 computers. This also has scaling limitations based on memory of the session, the number of computers, the number of sessions, and their TTL (someone can leave the search page open for day and hit "next page" - should the sessions still be open? Thats an answer each search engine has to decide).

The reality is, if a customer wants deep pagination, they are better suited to a full data dump (i.e. full table scan) or using an async search API, rather than a sync search API.


Well at that point, who really cares if the content of the 1001s page is deterministic, or in perfect order? Get the first 100 or so pages right, and thereafter just request the nth results from each of those m computers. No merge and no memory explosion, you'll just get them slightly out of order.


You still need to filter based on the other indexes. If you search for [bitcoin mining] you don't want to find pages related to coal mining. So this data still needs to be joined.


the search term for this is intersection. The posting lists for the two terms are intersected, then the results are ranked. But there are a lot more steps in a production search engine.

The long and short of it is if you really want the full results, just join google, join the search team, and then get enough experience so that you can do full queries over the docjoins directly. This was part of Norvig's pitch to attract researchers a while ago. For a research project, I built a regular expression that matched DNA sequences and spat out the list of all pages containing what looked like DNA and then annotated the pages so in principle you could have done dna:<whatever sequence> but obviously that was not a goal for the search team.


I used to work at Google but not in search, these are just my own guesses.

> where Google is getting that count

This is very likely a fairly accurate of the number of pages in Google's index that "match" the search query. Basically exactly what you would expect when you see the number.

> why the reality is so different

Cost reasons. Most search engines are more or less scanning down a sorted list of pages. The further you need to scan the more expensive it is. Just like running "OFFSET 1000" is usually slow in SQL. At some point the quality of results is generally very low and the cost is growing so it makes sense overall for Google to just cut it off to prevent it becoming an abuse vector (imagine just asking Google for the 10 millionth page of results for "cat").

The fact that few people realize that Google has a page limit shows how rarely people actually want more pages.


> The fact that few people realize that Google has a page limit shows how rarely people actually want more pages.

I used to (years and years ago) go past the first page pretty often, but results are so bad now that it rarely helps, so I almost never even click "2", let alone later pages. It's all gonna be obviously-irrelevant crap google "helpfully" found for me or the auto-generated spam that google used to try to fight (circa 2008 and earlier) but no longer seems to, just letting it gunk up and dominate up any results you get that aren't from a handful of top sites.

So this is in part one of those "we broke a thing and now no-one uses it, guess they didn't want it!"


What's really weird is that sometimes you get results that are outright repeating on those first N pages. Sometimes, more than once.

It's almost as if it tries to pad the output to be long enough that you'd lose patience before you reach the end of "effective pagination".


The thing has always been "broken". Google has had a page limit for at least a decade.


No, by "broken" I mean "let lazy auto-generated spam take over the results almost completely". So now those of us who did used to browse past page one (which, to be fair, may not have been many people) don't bother anymore.

[EDIT] For those who weren't around for it, Google used to play cat-n-mouse with spam-site operators. It'd go through cycles where results would slowly get worse, then suddenly a ton better, though never as bad as they are today. Around '08 or '09 they (evidently, I'm just judging from the search engine's behavior starting around then and continuing to this day) seemed to give up and just boosted a relatively small set of sites way up the results, abandoning the rest to the spammers.


Part of the difficulty is, if very few people are browsing to page 2, deciding what to put on page 2 becomes harder and harder.

Google has a lot of user behavior signals to decide what should be in results 1-10. Deciding if a page should be ranked 20, 200, or 2000 without any user clicks to check if you're right is really difficult.

I would bet that since 2008/9, the relative numbers of spam site operators, Google engineers, second-page searches have changed significantly.


Kagi has been working very well for me as an alternative


I find search results are frequently even worse than this, in that the first page will have nothing useful, with about three good links split between the second and third page. If I'm lucky.


If you've ever read Larry Niven's Fleet of Worlds series, there's a Bussard Ramjet with an AI programmed to hide any information that could help a hostile enemy/force find their way back to Earth.

A small cadre of humans who were raised by an Alien Race who came across a human seed ship cross paths with this Ramjet, and one of the protagonists realizes something is off when they do a query on the size of presentable search results in the astrographic/navigational dataset, and realizes that the number of starmaps the AI will produce is far smaller than the amount of space the system actually dedicates to storing said maps.

Point being, you can't trust any system that restricts results to a subset to not actually being designed to leave out results. and it furthermore makes a great, plausibly deniable way to drop search results... Force ranking to 10001+.

You'll forgive me, I'm sure, if I question a company well known for cooperating with an anti-humanitarian regime (Project Dragonfly) and that regularly black holes other undesirable datapoints, of engaging in less than up front search result presentation, I hope?


This isn't the revelation you act like it is. Because of course Google hides results. They don't pretend not to, and they even inform webmasters when it happens. The Search Console calls it a "Manual action" when they do so.

More importantly, the people asking for a "censorship-free search engine" are expressing an incoherent desire. The whole point of a search engine is to take the zillions of web pages that have matching keywords, push the crap to the bottom, and leave the gold on top. A system that does this is inherently censorious. We're just quibbling over what the criteria should be.

What our world lacks is a reasonably-quick way to hold Google accountable when they fail to represent the interests of the public who searches with them. The real-world consenquences of their filtering decisions need to filter back to the people making these decisions. Because "just don't make any filtering decisions" isn't going to result in a usable information retrieval system.


> "censorship-free search engine" are expressing an incoherent desire

That's not really true. `grep` is a censorship-free search engine. It just reports every matching result.

Of course that wouldn't generally be useful over the web, however even with sorting it is possible to be censorship free. You just need to include every matching result eventually.

Of course you would find that generating later pages likely also becomes expensive, so you may also add a page limit and ask the user to refine the query instead. Of course then you are back to this problem of it can be very difficult to find every result because you need to guess what words are on the page.

But all of this is basically moot because Google doesn't claim to be censorship-free so they have much simpler way of hiding results.


> even with sorting it is possible to be censorship free. You just need to include every matching result eventually

Do you honestly think that the people who complain about their favorite website being censored by Google would be satisfied with showing up on page 200*? I wouldn't.

It's only "not censorship" in the same sense that having your emails sent to the Spam folder isn't censorship. The spam folder, and low-scoring SERP results, are so full of items that every reasonable person acknowledges to be crap that getting banished to that area is pretty much equivalent to having someone blast your roadside protest with strobe lights and a sonic cannon. Surrounding you with so much garbage data that nobody can see or hear you any more is only "not censorship" on the dumbest technicality.

* Ignore, for sake of argument, the fact that page 200 won't even load in our universe. I'm imagining a parallel world where Google pretends to be censorship-free because they only push things far down in the results instead of removing them entirely.


"Do you honestly think that the people who complain about their favorite website being censored by Google would be satisfied with showing up on page 200*?"

My complaint has nothing to do with my favorite website. My complaint has to do with not being able to discover information and websites because Google won't allow me to dig very far into their search results. They're spidering the vast majority of the internet, and all I get are crumbs.


They're doing more than "push the crap to the bottom". They're pushing the crap to the bottom and then limiting how far you can dig into the pile. I am sometimes interested in that crap.


I agree. If you really want to see every result for a topic this system hurts you. However I think that use case is vanishingly rare. Most users would be better served by refining their query for what they are interested in than paging through hundreds of pages of results.

Google isn't designed to be a archive of every webpage matching a search result, it isn't what their infrastructure is optimized for.


"Google isn't designed to be a archive of every webpage matching a search result, it isn't what their infrastructure is optimized for."

I believe that's exactly what Google is. Limiting search results probably has to do with being able to serve more queries and respond quicker.


>The fact that few people realize that Google has a page limit shows how rarely people actually want more pages.

the fact is I just want the long tail or weird results to escape content farms, but I guess if it were possible for google to serve those content farms would spring up to game the long tail or weird results market.


Google tries to ignore them already, so the long tail is probably littered with old and mitigated content farms because they "match" but have a low page rank


Has anybody noticed that a Duckduckgo search (at least with infinite scroll) will serve you the same first page of search results over and over again? I've been trying to figure out whether that's a bug or if they're really fluffing one page into the appearance of multiple pages.

I think the internet is closing up.


It seems DDG has recently started ignoring entire keywords in a 3 or 4 word search phrase without informing the user a keyword has been excluded.

Something like "2015 camry leaking exhaust" could return nothing but "2015 camry exhaust" results until "leaking" is put in quotes (not a working example). Google search ignoring keywords was the straw that broke the camel's back - is DDG following suit?


Same with Google Images. If you keep hitting more and more results, it'll just recycle the top ones.


How can it be anything but intentional to give you the same image over and over again in an image search? That's not something you could possibly miss when checking for the quality of results.


another search means another opportunity to serve you more ads for different keywords


I've noticed plenty of websites now do this.


Maybe they used a char to store the array index and you hit an integer overflow?


Related comment from someone from Google from not too long ago:

> It's because the counts are very fast, rough estimates. And when you go into additional pages, we start to refine them.

https://news.ycombinator.com/item?id=32354785


Try searching for "january 6" on bing for some insight in how results are both manipulated as well as censored:

https://www.bing.com/search?q=january+6

Once you get past page 6 of the 96 200 000 results they keep on repeating the same results, page after page, with either a Guardian (January 6 committee postpones Wednesday hearing over …) or (every now and then) Yahoo (Jan. 6 hearings to resume following bombshell revelation about …) article on top of the page. The rest of the page is largely identical, page upon page until it comes to page 32 (307-316 of 96 200 000 results). Whatever you tell it to do beyond that page it will always serve page 32 with that Guardian article on top and the other - similarly slanted - results below it.

I don't know whether this is just another example of typical Microsoft incompetence in that they make their meddling with the results so incredibly obvious or whether they're just telling visitors this is what they should read and nothing else but it does show these search engines are as unreliable when it comes to politically sensitive topics as e.g. Wikipedia is.


DDG draws at least the vast majority of its results from Bing, and does the same thing. Also, all of their news results seem to be links to MSN and Yahoo versions of stories that were published on other websites.

I don't know that what you're seeing has anything to do with politics. For example, I just searched for "dog food nutrition," (https://duckduckgo.com/?q=dog+food+nutrition) and

https://www.akc.org/expert-advice/nutrition/soy-in-dog-food-...

Was the 14th result, and also the 36th, 68th, and 87th. It is also the 108th result, where it first starts topping the page, and from then on it's at the top of the rest of the pages, which are repetitions, so 128th, 148th, 168th, unto infinity.

I just counted this result because it was topping the repeated page of results, but all of the other results were repeating arbitrarily until it was down to the same 20 results repeating. I'm tempted to sum the number of appearances of every result. I'm not sure there are more than 40.

Why "dog food nutrition?" I figured there would be a lot of hits. I was wrong.

-----

edit: https://www.bing.com/search?q=dog+food+nutrition

https://www.akc.org/expert-advice/nutrition/soy-in-dog-food-...

22nd result, 31st, 45th, 59th, 73rd, and plenty from there on out.

https://www.petmd.com/dog/nutrition/can-dogs-eat-peaches

10th result, 20th, 34th, 48th, 51st, 66th, 70th, 88th, 93rd...

... and the pages start repeating from page 9 on with our old friend "Soy in Dog Food?" on top.


Yup, it definitely reproduces, but it doesn't seem to be restricted to political queries.


Really strange behavior. For me the pages seem to be all identical after page 2.


That is not how it is behaving at all for me


What do you see for that query? Are you logged in? Which country? Which browser?

Here: Not logged in, Sweden, Firefox on Linux/Firefox on Android. Same results every time, on both platforms.

I took screenshots for pages 6-15, compare these to what you see and let us know in what way your results are different:

https://imgur.com/a/QGT044T


Google is hardly the only search engine which exhibits this behaviour, and the information of total possible matches is useful --- it indicates a grossly generic search phrase --- even where all results are not presented.

As I noted recently in a similar thread (<https://news.ycombinator.com/item?id=32923468>, HN's search through Algolia is similar.

An unqualified search reports (at this writing) somewhat north of 30 million results (30,121,402). It will display only 34 pages' worth of results, I believe 1,000 in total.

See: <https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...> Final page: <https://hn.algolia.com/?dateRange=all&page=33&prefix=false&q...>

As discussed recently on HN, the match count is useful information even where all matching results are not displayed as it indicates whether or not a query is generic.

Contrast search: <https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...> (one result as of this writing).

I can understand some confusion on this point, but implying some sort of gross fraud or deception suggests far more wrong with the complainant, not their subject.


It makes sense if you separate the indexing process from the result ranking/retrieval process. Building up the index it has encountered the word “dog” millions (billions?) of times, but the Google search API’s job is NOT to operate as a catalogue that will list every single web page on the internet with the word “dog”. Instead its job is to show you what it thinks are the most relevant dog results.

That said I think this could be communicated more clearly when you get to the end of the results. The fact that the results number _changes_ is an own goal that makes it seem like they have something to hide. Instead it should just say “Google Search shows a selection of the most relevant occurrences of your search term. If you didn’t find what you need, try XYZ.”


I don't know where that quote is, but I remember reading one from someone at The Google who, in response to a question about DuckDuckGo, said something along the lines of "We're not competing on search." Turns out what they were saying was pretty truthful, assuming I'm not misremembering. Their "search", last I looked, shoves a bunch of crap above the fold that aren't exactly the results per se. The actual results usually aren't plentiful like I remember them once being. YouTube search has become astonishingly worse to such an extent that I get maybe 3 relevant results and the rest being videos I've already watched or stuff I'm clearly not interested in.


Incidentally, conspiracy theorists have used this behavior as evidence of the "Dead Internet" theory.

https://www.reddit.com/r/conspiracy/comments/xt8jzj/complete...

For those who are behind the times on the "Dead Internet" theory:

https://www.theatlantic.com/technology/archive/2021/08/dead-...


These posts are just stupid. Search is for finding what you are looking for. I feel like most of the posts in here think that it should be for browsing content with the word in it. There is nothing to see here.


>Search is for finding what you are looking for.

And at that, it performs miserably.


I want an index, like in a library, with the ability to filter on it.

I want to know all content, add many different filters, sort by different properties, and get all results.

Just like I'd do with an auto sort/filter table in excel


Oh, then you want to use this instead of a search engine:

https://medium.com/@brevityinmotion/search-the-html-across-2...


Of course they aren't because it makes zero technical, economic or product sense to literally serve hundreds of thousands of pages to people. Everyone understands the number to mean something like the rough popularity of the query. I don't think I've ever seen an angry protester in front of the Googlplex complaining they couldn't actually get to page 250k undertaking some sort of Jules Verne-esque journey to the center of the earth.

The post is basically a one year old ad for an alternative search product


What I have an issue with is that Google's numbers change if you click on a few pages of results. Search for a keyword. Go to the bottom and click on the last page of results. Then keep clicking as deep as you can... notice that the number of results for that keyword will change.

For one keyword search query, first it's 9,480, then 11,800 and then 105. Changes if you click on the next page of results (page 2, 3, 4, ... 10, etc.)


It shows not the number of shown results which is pretty much same for any query, but number of all pages that contain the keywords. It is not really feasible to rank and show millions of pages, but is possible to do that for few hundred pages. This is why search engines work fast.


It was once possible to trawl through results pages 'near indefinitely'.

How I see it is that it is like how the music search works with Google Assistant. It can do it for 90% of the songs you are likely to hear without an online connection.

For anything else it can do the search on the server.

When you search something it is not as if a billion Google servers are consulted with the results delivered accordingly. The nearest box just knows 90% of the 'answers' and you get whatever that is, rather than a 'real search'.

If you enter in a specific code, e.g. a ISBN number, then that is like the 'unusual song' that needs actual looking up.

However it works, it is a good hustle. Actually nearer to 'the hitch hikers guide to the galaxy' in having a compiled book of answers.


I actually like this feature. I don't want to see the 1,000th results (who would?), but it does help with perspective. If I'm searching for something that should be common, having an order of magnitude helps me realize I'm going in the wrong direction.


yea, google has been terrible at this for a while now, the most annoying thing is that you can't get to those results, i haven't found a way which is quite frustrating if you can't find what you're searching for in those few results


Imagine believing there's a sensible discrimination between the 200000th result and the 200001st result. The relative ranking among the top results is supportable but below the top 10 and especially the top 100 the relative ranking has no signal.


I might not be able to discern between the 1000 and 1001 but there’s no reason I can’t discern between 1001 and 100001. Things can be fuzzy nearby while still being coherent overall.


I used to believe that, until I played https://semantle.com/.


Isn't it great that google has got rid of so many bugs?

After all, more than 1 result is a bug: https://www.youtube.com/watch?v=XeIIpLqsOe4


Google will never return more than 400 search results. You may think you are getting more because you use the default 10 per page and so that's ~38 pages of results. But if you really look it will never be more than 400.

This pissed me off so much I actually registered and made http://googlesearchonlyreturns400results.lol/

And as I say on the page, the only way to approach this is to send feedback to google. We have to absolutely spam them with it so they know this is not acceptable.

Bing, btw, only ever returns 900 results.


Perhaps it's meant to limit what people can scrape from search results? A human user shouldn't need to see every result.


TL;DR: "You know that Google problem everyone knows about, how it reports millions or even billions of 'results', but actually serves a couple hundred at most? Well, shocker of shockers, it turns out that the API has the same problem!"

Google really needs to stop using the word 'result' for whatever it supposedly denotes, which is probably something like the estimated number of pages crawled by Google which contain hits for those terms, the majority of which will not be a search 'result' (something delivered by the search function) under any circumstances.


Take note anyone looking to make any kind of search. Just do Math.rand on the number of total results.


Very interesting - tell us more please.


The FCC needs to step in...


Google is garbage. Bing is a bitch. DuckDuckGo doesn't do great, Yandex is yawn. search sucks.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: