Hacker News new | past | comments | ask | show | jobs | submit login
How Search Works (google.com)
324 points by vijaydev on March 1, 2013 | hide | past | favorite | 101 comments



I missed most of the content on this ... page ? Exhibit ? Installation ? whatever it's called, because it told me to scroll, I did, and I scrolled through a bunch of what looks like empty space and arrived at the end ("and that's how search works"). The user is apparently supposed to stop and watch some animation at certain places, but it's not clear where to stop scrolling.

Perfect example, near the top there's some text about "It's made up of over[........] 30 TRILLION[.........] INDIVIDUAL PAGES[........] and it's constantly growing." But there's nothing to indicate that I should stop somewhere and wait for some more text to show up.

Maybe they should limit how far down you can scroll by setting the height of some element, and only increase it when the animation is finished.

Edit: the key problem here isn't the "scrolling makes things happen" gimmick that's popular lately. the problem is that it starts certain animations or fade-ins some time after I've already skipped past an apparently blank space.


I gave you a +1 because I had the same "scrolling issue".

After your comment I noticed a lot of comments on the same issue, so I decided to try it again. The second time I noticed that a blue arrow flashes at the bottom of the screen after all the content has populated, almost promoting you to scroll down. I suppose most everyone, including me, initially scrolled to fast to even see the first "arrow/prompt". Despite the discovery of the prompt feature, some of the issues remain. Example, wondering how far to scroll down before stopping (maybe pgdn?), and wondering which parts of the "page ? Exhibit ? Installation ? whatever it's called" are interactive.

>page ? Exhibit ? Installation ? whatever it's called

I too was unsure what to call it, but if they listen to the feedback, I think "whatever it's called" is really awesome and could be a legitimate substitute to the ppt platform. At least I would be interested in making a few presentations with it.


>I too was unsure what to call it, but if they listen to the feedback, I think "whatever it's called" is really awesome and could be a legitimate substitute to the ppt platform. At least I would be interested in making a few presentations with it.

Agreed, I like it. It looks like the stuff that goes on in futuristic movies, with a lot of things happening on the screen and no one is really looking at anything particular but it's impressive. This has a potential. Who's up to make this WIC software?


Yep, happened to me too. No indication of where to pause and watch, and had to keep back-scrolling as I noticed text appear just below the chrome. Bad implementation.


Scrolling with single 'space' key presses works fine on a desktop.


The most interesting thing there is the live view of the most recently deleted webspam. I wonder what blackhat SEO firms can learn from that to better avoid the filters.


Exactly. And someone somewhere writing a script to hammer that screenshots to collect as much as he can.


I haven't looked too closely, but that view says it is only giving examples of "pure spam," as opposed to the more sophisticated forms of spam described. One might imagine that "pure spam" is the easiest to detect, so giving examples of pure spam might not be giving much away.


It's nice overall, but the timing for making items appear is a little slow. I was past most headers by the time they appeared, and I don't think I scroll too incredibly fast.


It might be your machine/browser/os. The same thing happened on my laptop (OSX/Chromium), but it worked fine on my desktop (Arch/Chromium).


If you read the top voted comment for this post (as I write this), it describes the same experience though.


thx matt and the google search team for doing this. it's nothing new for technically inclined people, but every little bit helps. helps for what? teaching people to worry about the right aspects of search and the impact on their business, instead of worrying about bullshitphrases that were planted in their head by a SEO agency key account or a blogpost from 2008. so well yes, thx for doing this. i will send it to my clients (and tell them to click on the bubbles, even though they don't look clickable)

now an anecdote (because i feel like telling one): this week started for me with an interview that finally got published http://werbeplanung.at/news/marketing/2013/02/interview-mit-... (it's german) in that interview i claimed that

* 80% of everything written about SEO and Google is bullshit

* that all the rumors, tipps and trends are actually hurting business

* that we should treat SEO as a numbers based craft of constant optimizations

* instead of the esoteric bullshit art it is currently

* and, if search traffic is important for the success of a business, they must rid themselves of external (agency) dependencies and develop internal structures

nothing to far fetched i think. everybody knows the SEO vertical is full of bullshit, i just took some time to estimate a number (based on a random sample of collected blogposts (that at least one person tweeted about))

yeah, i got a lot of angry emails, skype messages, linkedin messages, xing messages after the interview was published.

most of them mentioned at least one of these words

  * pagerank
  * whitehat
  * blackhat
  * grayhat
  * linkjuice
  * panda
  * pinguin ...
so yeah, thx google for educating people about search. keep up the good work.


Your 80% is based on what exactly? A tiny sample size. Please if you don't have solid data don't quote percentages it just encourages people to spread the number like it's a fact, which it isn't.

If you read the right sources a majority of seo advice is correct.

Www.seomoz.org http://static.googleusercontent.com/external_content/untrust... Www.inbound.org (homepage stuff that has been voted up.)


>> If you read the right sources a majority of seo advice is correct.

That's a contradiction. If you have to read the right sources, then by definition the majority of advice is not correct.


Why does it mean the majority of advice is not correct? That is a myth.


Because the majority of advice is coming from the _wrong_ sources


You say majority like its fact. Its not.

Here is a list the top 100 seo blogs, find the BS in there. www.branded3.com/seo-blogs/


  "80 Prozent von allem, was über SEO geschrieben wird, ist Bullshit"
some things just cross the language boundaries.


Has anyone deciphered the fat-mustache diagram in the "Query Understanding" circle? It's in the Algorithms section.

At first I thought it was supposed to represent a Gaussian-like probability distribution. But when I clicked on it, the resulting animation showed a series of such distributions getting flattened by some kind of distribution-flattening hydraulic press. The accompanying caption: "Gets to the deeper meaning of the words you type."

If I was confused before, now I was completely lost.

How is deeper meaning represented by distribution flattening? I'd think it would be just the opposite, raising probability mass around the likely meanings, not spreading it out into a uniform distribution over all meanings.

Baffling.

If anyone has figured it out, please do share.

(Maybe I'm taking the diagrams too seriously.)

EDITED TO ADD: New option: If you don't have any clue what it means either, come up with an entertaining yet plausible story that fits the hydraulic-press-vs-mustaches animation and share that story instead.

EDITED TO ADD: Example: At Google’s new eco-friendly data centers, NLP computations are performed by genetically enhanced inchworms. Difficult queries, however, can cause the inchworms to get cricks in their backs. In such cases, Google’s innovative back-massager descends and restores the inchworms to their preferred position (prone), from which they can return to their computations with renewed vigor.


You're taking diagrams too seriously.

But the way I interpreted it was, before, the query was short, scrunched up, and slightly ambiguous. The algorithm them lengthened it, representing expanding it to find the deeper meaning.


I was confused for a little bit by that. The way I took it was: "Google removes the wrinkles from your query as to make it processable."


That actually seems to be what Google does to your keyword searches: replace the specific with the general, turn proper names into redundant phrases ("schannel socket" -> "channel socket"), suggest dropping keywords, etc.


I think Query Understanding might trigger the weather, conversion and the other widgets to be displayed at the top of your search results. It's just a guess, since I don't work for Google. (:


I don't know what to take from this.

That search is very complex (I knew that, but not with this technical detail).

Or...that Google is trying very hard to maintain user interest with gimmicky shows of why it's cool and cutting edge and necessary.

Not that Google isn't those things...this just seems like an unnecessary expenditure of time. We know it's complex Google. Improve some other features and stop shutting others down instead of making these web 2.0 animations.


I too find this completely void of any useful information.

There are many things that are still broken in search; I talk about one specific experience here:

http://urgeous.com/p87t3aaa40g-for-some-queries-all-first-10...

("For some queries, all first 10 results on Google are spam").


Interesting that they show the approximate number of searches / second at the bottom. Is that an otherwise publicly available number?


I was halfway through before I realized that some of the content was clickable.

Very nice page, though.


Their characterization of their spam procedures is grossly misleading. They do not send emails to most people that have been penalized, nor do they give clear instructions on how people can fix their sites.

Thousands of small sites were killed by Panda for no good reason, and have little hope of getting their traffic/incomes back. Google's spam policy is skewed heavily in favor of large sites and their own properties.


Didn't read that way to me. Doesn't it say that the webmaster tools page is the primary way to get notifications?

Crap factor = %advertising on page.


I keep checking every so often, but searching for "this phrase" or +absolute +requirement is still broken. Even "Verbatim", isn't. If they can't even get simple search right, who would trust them with anything more?


I agree, it is super irritating to no longer be able to do precise searches on google like I used to. Is there another search engine you would recommend which provides this functionality?


Same here. Tried reporting it a number of times in different ways but nothing ever happens. Have posted a few examples as response to moultanos answer further down in this thread.


Do you have some example queries to debug?


I wish I had saved the results every time over the last few years that Google showed me a page it claimed had what I was looking for, when neither searching the visible text or even the source code of the page produced any such string. I am sure that it's happened to me hundreds of times by now, if not thousands. For a long time, it was surprising and ranged from annoying to infuriating. Now I just sigh and accept it as the cost of Googling.


I have: - http://techinorg.blogspot.com/2013/03/what-is-going-on-with-...

edit: Here are just the queries:

"sublime text 2" "focus group"

cisco "anyclient" - this one gets silently rewritten to cisco anyconnect

shopify "deduplicate" - with verbatim activated -


Scrolling is really becoming the new thing in UX design. It's an interesting contrast to the 'movie-like' flash animations of a few years ago that required no interaction on behalf of the user.


At some point users started closing the tab instantly as soon as it becomes clear something non-interactive like a flash movie is the central element of the page. The scrolling page is the optimal way of letting the user read the content at the desired pace.


A youtube clip would have done just fine. You can pause that if you need a slower pace. Scrolling is very slow here (and I'm using Google's own Chrome on mainstream hardware), and it's never clear where I should stop scrolling to view the page. I don't get the fuzz about scrolling websites.


Scrolling is really becoming the new thing in UX design.

Am I the only one who finds it irritating as hell to scroll when it renders slow? I don't think this is the end game. There has to be something better.


It's not only you, I found this irritating as well.


Since forever, Chrome has been doing scroll at less fps than Firefox, where I can read comfortably while scrolling.


I think this will improve over time.


They left out the part where they index your emails and choose items you agree with over items you don't :)


I think you're joking, but just for those that don't think that, we don't actually do that.


The following is direct from Google's Security and Privacy:

"In order to provide some of the core features in Google Apps products, our automated systems will scan and index some user data. For example:

-Email is scanned so we can perform spam filtering and virus detection.

-Priority Inbox, a Gmail feature, scans email message to identify which messages are considered important and which are considered not important.

-If you are using Google Apps (free edition), email is scanned so we can display contextually relevant advertising in some circumstances.

-Some user data, such as documents and email messages, are scanned and indexed so your users can privately search for information in their own Google Apps accounts.

*Google Apps data is not part of the general google.com index, except when users choose to publish information publicly."


then how can you scroogle and boobble people without those informations?


Where is the stuff about the creepy invasion and abuse of our privacy?

I know, I know, you don't do that. Nope, no one does. Everything is fine and dandy. Smile every one, no problem here.


There definitely is a form of a search bubble though, right?


I don't remember if hotmail used to run ads.


It did and they were display ads. Incredibly distracting.

At one point the "homepage" of Hotmail was a huge ad space, stories from MSN, and a tiny link to "Inbox."

The new Outlook is so much better. If Hotmail had evolved that way earlier, I would not have switched to Gmail.


So it is a bit hypocritical of MS talking about ads in gmail. But again where those ads contextual?


The "Scroogled" campaign has nothing to do with products or customers; the point is to broaden the PR base for Microsoft's ongoing campaign to convince the feds to initiate anti-trust proceedings against Google. That is why they hired a political PR executive to create the campaign.


Grammatically, that makes no sense :)


38,800 requests/second according to their estimation.


Seems I'm not the only one who found that interesting :)


Is this just PR for Google? Would rather see a more technical approach - although great for forwarding to clients when asked :)


Apparently, perhaps the 'scroogled' campaign is having an effect.

However it does give a better insight into the challenges of building a search product. It is a series of really challenging problems. So many people take search for granted these days.


Yes and no - 'scroogled' is bring up stuff like - selling ads based on context - but are you paying $20/year for outlook.com email? Gmail is free and pretty awesome (haven't gotten spam in years).


It may just be PR, but I do think a lot of non-technical people can benefit from going through the animation. It's pretty amazing that that is what's happening. It could just be an attempt by Google to expose their craft to the technical layperson.


Whoa... really, 100 MILLION gigabytes to store "The Index"? Wow. That's big.


aka 95+ petabytes.


100 million gigabytes = 100 petabytes ~= 88.8 pebibytes

100 million gibibytes ~= 95 pebibytes


I see the value of this distinction, but I can't shake the feeling that a word used for years has been co-opted by marketing and replaced with something that sounds silly when spoken out loud.


I prefer 100 MegaGigaBytes, son.


There are some good facts and numbers hidden in rather toy explanation:

1. Spam detection is automatic

2. There 6 types of spam

-Unnatural outbound links (link selling)

-Content copy/manufactering

-Keyword stuffing

-Forums/user generated spam

-Parked domains

-Sites hosted on spammy DNS

-Different content humans and bots

-Hacked sites

3. Google is removing as many as 50K spam sites per month, they get 8K reconsideration requests

4. Google's machine learned relevance model may be using about 200 features


> By the way, in the 47 seconds you've been on this page, approximately 1,813,260 searches were performed.

Aren't these just some random numbers that they pull out of the air?


Here's the unminified JS on the site responsible for the numbers updates.

   var kd = function () {
    function a() {
        e = e || Q("number_of_seconds");
        d = d || Q("searches_count_num");
        f = f || Q("searches_count_unit");
        var a = ~~ (((new Date).getTime() - h) / 1E3 % 86400),
            k = a * b + "";
        f.innerHTML = " " + c[Math.ceil(k.length / 3)] || "";
        e.innerHTML = a;
        d.innerHTML = k.replace(/(\d)(?=(\d\d\d)+(?!\d))/g, "$1,")
    }
    var b = ~~ (1E11 / 2592E3),
        c = " hundred thousand million billion trillion quadrillion quintillion sextillion septillion octillion nonillion decillion undecillion duodecillion tredecillion quattuordecillion quindecillion sexdecillion septendecillion octodecillion novemdecillion vigintillion".split(" "),
        e, d, f, h = (new Date).getTime();
    return {
        hc: a,
        rb: function () {
            a();
            setInterval(a, 100)
        }
    }
   }();
It's just running on an interval and doing in-page calculations, so it's entirely estimated. The value of "b" in this function evaluates to a little over 38,000 (https://www.google.com/search?q=1E11+%2F+2592E3) which they're using as the basis for the calculation.


Not sure why you're being downvoted, as it seems like a legitimate question, but...

I don't think so. It seems logical that Google's been keeping statistics about this sort of thing, so it doesn't surprise me that they keep track of such things as 'average queries per second'.


That would be about 38K searches per second. Does this include Google instant searches?

Google search results show a time value for each search. E.g.: About 2,210,000,000 results (0.12 seconds). Is this time machine time per search? This number is often around 30 ms, give or take a factor of two. If so, each machine can handle about 30 searches per second. If so, 38K searches per second need about 1000 machines. Sounds a bit too low... so my interpretation must be wrong at least somewhere.


It's probably the wall time for the various backend services to respond to the query. If you think about it, a Google search result is actually many things; it has results from various sources, such as the web, images, videos, news, social signals from G+, etc. All of those are different services that are aggregated to build your result page.

Since all of those queries are fired at the same time, the only metric that matters at the end is the wall time, not the CPU time used during the query.

I also seriously doubt that the servers that handle the Google front page can only do one query at a time; at the very least, they're multithreaded, but probably concurrent. It probably works as below:

1. Parse query 2. Send query to backend servers 3. Wait until all backends replied or at most 250ms (or some other timeout) 4. Assemble the result page and ship it back to the client

While the server is idling for the backends to reply, it probably processes other queries; it wouldn't make sense to waste that much CPU power.

Finally, your example says 0.12s (a random query on my end gave a response time of 0.69s), which is 120ms (or 690ms for mine), which is more than twice 30ms.


You didn't define 'machine'. If the 'machine' is Google's supercomputer grid cloud cluster, then yes, each search takes 30 ms of machine time.


Is there any publicly known information about what the 30 ms number means (or alternatively what the machine is)? Given 30 ms number and the number of searches per second, the number 1000 means something; I just don't know what.


It probably just increments by some fixed amount each second, but it seems like a statistic they would have an estimate of.


From the beautified app.min.js:

    function a() {
        e = e || Q("number_of_seconds");
        d = d || Q("searches_count_num");
        f = f || Q("searches_count_unit");
        var a = ~~ (((new Date).getTime() - h) / 1E3 % 86400),
            k = a * b + "";
        f.innerHTML = " " + c[Math.ceil(k.length / 3)] || "";
        e.innerHTML = a;
        d.innerHTML = k.replace(/(\d)(?=(\d\d\d)+(?!\d))/g, "$1,")
    }
    var b = ~~ (1E11 / 2592E3),
So yes.


A beautifully designed page more than anything else


Nice scroll-UI! Took some time to see the clickable items. Interesting bits about spam pages.


An awful way to learn anything.


The better people understand their tools, the more effectively they can use them.


"We write programs & formulas to deliver the best results possible."

No kidding.


Some of the live listed 'spam' pages appear to be genuine to me.


Answer: It uses a bunch of skip lists.

Source: I do hacking on top of lucene.


vijay: very interesting link. thought it was interesting, despite the obvious slant.


This is not how search works!!


This is brilliant !!!


"We write programs & formulas to deliver the best results possible"

There's a slight oversight, it should be: "We write programs & formulas to deliver the most profitable results possible for this quarter"


This is completely false. The effect on revenue is not used to make launch decisions for ranking changes.


So when Google's Panda update killed tons of user-generated-content sites like Mahalo, eHow, HubPages, etc., and greatly improved YouTube's (which is 99% garbage) rankings, that was pure co-incidence?

What about when Google rolled out universal search only after buying YouTube?


    "What about when Google rolled out universal search only after buying YouTube?"
Because the technology wasn't built then? Google had a video platform before they owned Youtube, you know.


I don't miss any of the crap content factory sites. Do you?


Nope, and I wouldn't miss YouTube's crap factory site from the SERPS either.


Says who?

Search is rank and display. Products is 100% bought and that you had to "disclose". I say "disclose" because it's not apparent, unless searches click on a link, that's how ethical you are.

What else is bought and paid for behind the scenes? Why should we trust you?


I do not know why you were down-voted, perhaps for not fully forming an argument and making a tongue-in-cheek comment as an immediate response to the arrogant statement Google unnecessarily included in the description of how search works(yes, the same statement stood out to me, as out of place and arrogant, but not necessarily untrue).

As to your point, yes, Google does utilize its power, leverage, dominance to favor itself and its own products - and don't feel to bad others are demanding you show your proof - how quick they are to forget (and apparently Google's own employees who replied to your comment forgot also) that the FTC spent the last year investigating Google's behavior on this front - some of those charges into Google's behavior include using its knowledge of search and advertising to determine the most profitable online businesses, entering the space with their own product to compete directly or just drive up the price of the advertising terms (sometime 1000%). So imagine you were buying key word "y" for "$x"/click - Google comes along and competes, now their product is at the top of the organic results and you will need to pay (1000 x "$X") for the same advertising - oh by the way when they pays (1000 x "$x") for the same ad space that money just goes back to its own pocket.

So do not feel to bad - the FTC spent millions investigating Google to find said evidence and ultimately allowed Google to settle for $22.5 million, Google allowing others to use the Motorola patents is acquired, and changing their AdWords API. And keeping with their motto: "Don't be evil" it appears in the last 24 hours media has gone wild alleging Google spent $25,000 to honor the FTC Director during the investigation - I know when I am being investigated for federal anti-trust allegations I too like to honor the investigator, and like Google I do not give the investigator the money directly, I give it to a 3rd party who in turn gives it to the investigators office, this allows the investigator time to close the case before allegations are made and when allegations are eventually made it allows the investigator the opportunity to say at the time it was unknown who "donated" money for the honorarium.


> There's a slight oversight, it should be: "We write programs & formulas to deliver the most profitable results possible for this quarter"

Says... you, right? Based on which examples of Google pursuing quarterly profits at the expense of users?


"Says... you, right? Based on which examples of Google pursuing quarterly profits at the expense of users?"

Duh! Google local was a joke compared to better pages from Yelp. Google+ even worst. Pages are now filled with ads, because Google discovered that ads yield better results (how convenient!) Need I go on?


Yes, actually. And with real evidence other than your skewed opinions. Or you can keep being overly dramatic, your choice.


"And with real evidence"

what would qualify as "real" evidence to you? A hidden camera catching Googlers talking about this?


It seems you're making an argument of the form "it's too difficult to collect real evidence. In fact, here's an example of a way to collect real evidence that is ridiculously difficult. Since we cannot collect real evidence, we need not use it to make arguments."

Just because it's difficult doesn't mean you can jump to conclusions based on speculation.


No it's not difficult at all. Re-read my comment and see how I gave three clear examples that explain Google's "best results for the users" bullsh*t.


Well, with the certainty you speak of this with, yes. Oh also, I'm being facetious because I work there and I'm interested in this secret pile of evidence you have. All things being fair, I won't deny that you've seen the things you've seen, it's just that you might be attributing them to the wrong places. Hence, the need for actual evidence.


I am certain and so is everyone with an open mind. Look at results pages and it's clear why all the Google pages and ever-growing number of ads were inserted, MONEY! Mo money for the always greedy Google.


Google has for all but the very beginning of its existance been an ad company. This is not new.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: