Hacker News new | past | comments | ask | show | jobs | submit login
What on earth are Google thinking? (re: Binggate) (puremango.co.uk)
234 points by user24 on Feb 2, 2011 | hide | past | favorite | 129 comments



> At first I thought there could be three possible explanations to Google’s handling of this situation:

I think he forgot option 5: whether Bing intentionally copied Google's search results or not, if Bing continues this practice then their index will contain a de facto copy of Google's index for tricky queries. Bing will inherit Google's spelling correction, it's long tail, its top results, and any other enhancements Google develops to return more relevant results.

Google invests so much into these things; if Bing absorbs these improvements without lifting a finger, Google loses its ability to stay ahead through technical merit and innovation.

I think Amit is telling the truth: "And to those who have asked what we want out of all this, the answer is simple: we'd like for this practice to stop." http://googleblog.blogspot.com/2011/02/microsofts-bing-uses-...


I think i would put it more succinctly this way:

Bing and it's boosters have been touting their search engine results as better and more relevant than Google's. As it turns out, the reason for that, is because they are tracking all of Google's results without reporting that they are doing so.

Additionally, the article posted here is disingenuous. There is a difference between tracking general click tracking, and what bing is doing, which is scraping queries and click results. That is not a general browsing feature, this is a search engine specific feature, designed to benefit off of the results provided by others.


How do you know its the former and not the other. Google's experiment doesn't discriminate between the two.


Sorry you're right. The other possibility is that bing is scraping the query term and all of the links out of the search page, and correlating them immediately. That means they're totally susceptible to google bombing (which is in effect what google did).

The key question is whether they have special rules in place for google. I am indeed presuming that if i were using the same tools google did on my personal website, i would not achieve the same result.


Even if they are susceptible to Google bombing, I don't think Google would reduce the quality of it's results for real searches because of this.

Bing probably has a ranking algorithm saying that links from google.com search pages have x weight and links from facebook profiles are valued at some other value etc.


They're not just susceptible to Google bombing. They are at the mercy of some easy to automate SEO with the bing toolbar installed. If I were a black hat I'd install the bing toolbar, craft some perhaps human looking Selenium code, and start the clicking. Over a period of a few weeks I'd be owning the top results in Bing.


I have seen some iMacros scripts for this in the past, also for the Google toolbar. Does it help to improve ranking? I dunno.


I think that you have identified the biggest flaw with Microsoft's approach.

They are also susceptible to Google "poisoning" the well, by creating a bot-net to provide fake data to Bing.


Does the Bing toolbar algorithm associate all query string parameters with the subsequent click action? Or did the Bing algorithm toolbar have code that specifically targeted click actions on the Google search engine?

Was it just the top result that was copied or were more of the results copied as well?


To scrape, wouldn't Microsoft need to be following up to the followed clicks by sending a search query to Google? If that was happening, I'm pretty sure that Google would have said so.


The scraping could be done client side, after users perform the search, just send all the data to MS.

It would be interesting to see where the law comes down on tracking clicks for a competitive advantage. I doubt anything will get past handbags at dawn, as it could draw attention to click tracking on ads, and no-one want that.


Good point. This would require that the Bing toolbar be sending requests to Google, right?


I don't think this is as unethical as you say. I sincerely believe all Bing is doing is using the toolbar to get (long) search trails.

Using search trails is a near-standard practice--to guess intent, find deep links, yada yada. Does Google want Bing to special-case links going out from their site? Or all search-engines maybe? What about techniques like this? http://www2005.org/cdrom/docs/p66.pdf

The paper authors sure call it innovative, not cheating :) (Also, techniques like this search+wrapper are great for Product Search, and I'd be surprised if Google doesn't use them).

I think Google's stance is rather unfair. While Bing may have "copied" results, I feel this was just a side-effect of trying to build a good index, rather than a case of "oooh.... let's see what's google returns for this, and just copy the result").


I disagree with you and agree with the author of the blog post. Microsoft is not copying Google's search results, rather, they use behaviour of their Bing toolbar users as one signal for their own ranking. And, if clicking through the Google SERPs was the only signal for certain made up search term, then it appears like copying the Google's results. But technically it's not.


I wonder if this can be tested somehow, just like Google did.


Sure. As Harry Shum said; this is a new kind of clickfraud. Just like when googlebombing become a public technique, Google would have been foolish not to change the behaviour of their algorithm. I'd be amazed if Bing didn't swiftly change their click tracking behaviour to explicitly blacklist google.

And if Google had waited for Bing to reply privately before making any of this public, I'm sure we'd have avoided all this drama.


It's nice that Google would like Microsoft to stop copying them. Everyone else who generates unique content (news site articles, TripAdvisor reviews) would like Google to stop copying their content into Google News and Google Places.

Apparently Google only cares about sites copying other sites content when their it's their own stuff getting copied.


Any site that wants Google to stop copying their content to Google news can ask Google news to stop doing so. It's very simple.


The real issue is that they don't want their content in Google News, but they also don't want to kill the golden goose that is directing traffic to their site. I.e. Google News is generating traffic for them, but they don't want the content to be on Google News either. They want to have their cake and eat it too.

edit: 'they' here refers to the old media newspapers, more specifically to Mr. Murdoch.


There's a big difference between copying content and linking to content. I don't think anyone has a problem with Google linking to them: that is after all the purpose of a search engine. But Google doesn't just link, they copy full content into their site, giving users no reason to click through to the original creator.

And Google doesn't let creators opt-out of just the content-stealing part of their scheme. If you want to be indexed by Google, you have to allow them to copy whatever they want from your site. And no one can actually opt-out because Google is a monopoly in search. So they get away with it, stealing pageviews from content creators.


Most sites are very happy to have traffic from Google. Those that aren't can simply configure robots.txt.


And Google seems to be happy that their search engine is accessible by IE users.


They could even just block users of the toolbar if they object to its behavior, since it's identified in the user agent.


If it were just click data, how would they get the terms?

They're either parsing the query out of the url, or violating robots.txt to fetch the result page, almost certainly the former. This seems like a pretty clear indication that they've special-cased clicks from google. It's theoretically plausible that they are treating all query parameters the same for all sites, but very unlikely given how much noise that would introduce into their results. Even so, they would have to know that most clicks with meaningful query parameters come from Google. This isn't something that's going to happen by accident.


> If it were just click data, how would they get the terms?

Exactly. It's not "click data" at all. It is monitoring user behavior on search engines, using both the clicks and the queries. Maybe it's not monitoring just Google as a search engine (although we have no proof of that yet: it seems it's just watching Google) but given Google's market share in search it doesn't make much difference.

From the article:

> I don’t even work in search and I could spot the real situation

This sentence is at the same time arrogant and funny. He doesn't work in search, but he's certain he's spotted "the real situation". How did he fact check it, besides asking himself if he was correct and answering "yes, obviously I'm right. I'm always right -- and I don't even know anything! I amaze myself."


> certain he's spotted "the real situation". How did he fact check it

By reading the official MS blog post[1], watching the Farsight conference live, and being confident in my own judgements based on the observed evidence. No other conclusion is supported by the evidence - you say "It is monitoring user behavior on search engines" - where's the evidence that it's exclusively search engines they're monitoring? Let alone exclusively Google? There has been no evidence yet presented. Therefore all we can conclude from Google's sting is that Bing use URL/click data. Which is exactly what I said in the original thread[2], in my blog post, and lo and behold it's what MS later said. That's how I know it's the real situation.

[1] like where Harry says "A small piece of that is clickstream data we get from some of our customers, who opt-in ... To be clear, we learn from all of our customers".

[2] http://news.ycombinator.com/item?id=2165947


You checked it against the assertion of the most interested party?

It's not "click" data, it's the correlation between search term and SERP. That someone visited such and such a page after a click isn't the issue. That someone Googled for a unique term before clicking is at issue.

Grandparent: http://news.ycombinator.com/item?id=2168377

Googling with Bing: http://www.collegehumor.com/video:1915736


> You checked it against the assertion of the most interested party?

There are currently exactly two sources: The Google post and the MS post. Who are more likely to know about what MS are doing? MS. I check my theory about what MS are doing with the source most likely to contain correct information.

> It's not "click" data

It's 'clickstream' data (MS's term). A 'click' comes from a page and goes to a page. That's the data MS were capturing. The page the click happened on (query happens to be included in URL), and the page the click went to. It's click data.


Your assertion Bing is most likely to be accurate about Bing ignores self-interest and spin.

However, agreed -- clickstream means series of clicks, and the actual data is a series of URLs.

The query "happening" to be in the URL has no "search > result" meaning without a parser being told to look for Google's particular keyword query indicators and correlate the subsequent page. As most URLs are not searches, this is not emergent behavior; it's programmed.

People also talk about this being a "weak" signal, but given search volume (or clickstream volume if you prefer) on Google versus other sources, even if this code is generic (e.g., recognize all "q=blah" or "search=blah" as keywords and correlate the following URL), it seems the signal would be strong indeed. Google's weak signal would provide several times more correlative data to Bing than Bing's own clicks.

Not that there's anything wrong with that! But Bing's blog assertions feel disingenuous -- they play this game well:

http://www.wired.com/epicenter/2009/06/kayak-bing/


> Your assertion Bing is most likely to be accurate about Bing ignores self-interest and spin.

We can't apply skepticism to one source and not the other. Either Google and Bing are not blogging with self-interest and spin - and thus Bing are more trustworthy because they're blogging about themselves, or they both are blogging with self-interest and spin - and still Bing are more trustworthy because they're blogging about themselves.

You just can't legitimately discount what Bing say because of self-interest and spin without also discounting what Google say for the same reasons.

> The query "happening" to be in the URL has no "search > result" meaning without a parser being told to look for Google's particular keyword query indicators and correlate the subsequent page.

No. remove non-alpha from entire URL with no preconception about search queries or any of that. You're left with "google com search q QUERYTERM". All the words apart from QUERYTERM has plenty of other signals in Bing's system. If QUERYTERM is a highly unusual word then all Bing have to go on is the data they gleaned from Google.


> We can't apply skepticism to one source and not the other. Either Google and Bing are not blogging with self-interest and spin - and thus Bing are more trustworthy because they're blogging about themselves, or they both are blogging with self-interest and spin - and still Bing are more trustworthy because they're blogging about themselves.

Libel laws prohibit indiscriminate accusation, while there's no law against puffery.

The premise a company's own public relations messaging is more trustworthy than an outsider because the company's PR is about themselves seems without merit -- otherwise we would deem companies more trustworthy whenever an outsider points fingers, and send all the journalists, whistleblowers, and wiki-leakers home. "Nope, sorry, I believe the company, because they're talking about themselves."

Google presented incontrovertible data. Bing's PR tactic is "Google does this or that worse thing and profits off it" -- distracting hand waving -- "plus we're not copying anyway" -- deliberately disingenuous.

Responsive would be:

Of course our toolbar is recognizing search terms across the top N search sites, and correlating human selected results as an indicator of search intent and result quality for those search terms. This is the same thing you do when you look at your own web stats and check inbound search terms for your own pages: 'How relevant are my pages, and am I showing my users what they are looking for?'

This is the very definition of 'improving your search experience' as outlined when you install our toolbar. We're thrilled so many of you chose our Internet Explorer browser and Bing Toolbar that this provides us meaningful data on user search intent. We want to thank Google for demonstrating we are truly 'improving your search experience' using well accepted Internet crowd-sourcing techniques.

We agree, however, that generating correlations solely from competitor listings -- when we have no existing correlation in our own data corpus -- could be misperceived, so going forward, we will not create correlations solely from competitor results where none existed in our data. However, like every webmaster, we will continue to use crowd-sourced search term and results data from across the web to refine our suggestion order towards best predicting the information you want to find.


Nice response. I wrote my own "What Bing should have said" response too - http://www.puremango.co.uk/2011/02/what-bing-should-have-sai... - I think we pretty much agree.

> The premise a company's own public relations messaging is more trustworthy than an outsider because the company's PR is about themselves seems without merit

That sounds right. Hmm, fair point.


> where's the evidence that it's exclusively search engines they're monitoring? Let alone exclusively Google? There has been no evidence yet presented

The evidence that has been presented by Google does show that MS is monitoring Google's search results, or at the very least users' behavior when using Google.

I agree that there's no evidence that Google is the only search engine that's being monitored, but it makes little difference since for all practical purposes Google == search.

There's also no evidence that MS monitors other websites / behavior other than just search engines, but it's unclear how that would work? You can't just associate two websites because some users go from one to the other (correlation vs. causation, etc.)

- - -

At this point we're still in the "he said / she said" phase, but Google has more evidence and MS is coming out as incredibly defensive (e.g., raising an inquiry from the European Commission[1]: what does that have to do with anything!!?!)

[1] http://twitter.com/fxshaw/status/32519996852674560#


> The evidence that has been presented by Google does show that MS is monitoring Google's search results, or at the very least users' behavior when using Google.

I said "exclusively".

> there's no evidence that Google is the only search engine that's being monitored

There's no evidence that it's only search engines.

> You can't just associate two websites because some users go from one to the other

sure you can, if they go via a link. Now you can just notice the link and say "great, there's an association", or you can monitor which links people click on and weight the graph accordingly. "great, there's an association, and this link is more popular than that one". Nice data to capture. Not constrained to search.


Google (and pretty much every other search engine) puts your search terms right up in the page title, and they show up in several other places around the page. The Bing scraper couldn't miss them.

Honestly, if Bing isn't giving Google special treatment, this whole debacle shows that their click tracking model works.


IE (with certain settings on) is sending page data back to Microsoft. If it sends the URL, title and referrer back then the following session is pretty easy to reverse engineer.

    1. URL: "test - Google"
       Title: http://www.google.com/search?q=test 
	   
    2. URL: http://test.org.us
       Title:  "Test"
       Referrer: http://www.google.com/search?q=test 
       Time: 2 secs

    2. URL: http://en.wikipedia.org/wiki/Test_cricket 
       Title:  "Test cricket - Wikipedia, the free encyclopedia"
       Referrer: http://www.google.com/search?q=test 
       Time: 249 secs
It's really just an extension of page rank by seeing what links are being clicked on and not just which links exist. Whether MS should be capturing this data under false pretenses is another issue.


If this is the case, then it's rather easy to stop Microsoft from doing this. Just use POST instead of GET in the search page if you detect the browser is IE8. The referrer will always be the generic http://www.google.com/search with no search term information.


Using POST instead of GET is not a good idea for a search results page. Most likely, the user would have to click through a dialog box ("are you sure you want to resubmit this form") every time the browser back button is used to return to the search results. Even if Google used redirects to circumvent this problem, searches won't be saved in the browser's history, which is a bit inconvenient.

Changing from POST to GET in IE8 would stop Microsoft from mining data in the short term, but would drastically decrease the UX of Google for a large portion of its users who could very easily switch to Bing.


OK, so for IE8 users return a single page AJAX app instead as a variant of what's already done with Instant. Still no referrers but no POST warning messages (which, BTW, have to be one of the most annoying things about ASP.Net web forms - postback was a boneheaded design decision from someone who didn't understand HTTP.)


Topic drift: but I'm convinced that ASP.NET postback was a very deliberate design decision from someone who perfectly understood HTTP - and is intentionally obscuring HTTP in order to keep programmers and implementers ignorant of it and dependent on the Microsoft ecosystem.


Twenty Google engineers were charged with getting Bing to hit the honeypots and given several weeks to do so. Balancing the oddity of the query terms with the level of resources Google threw into the operation, I'm not sure if 7/100 is impressive SEO or unimpressive.

What I do know is that the original suspicious result for "torsoraphy" is returned by Wikimedia's search algorithm. http://en.wikipedia.org/wiki/Special:Search?search=torsoraph...


No, Wikipedia admin Nihiltres intentionally created the "torsoraphy" page (at 3:03 today, no less) and made it redirect to "Tarsorrhaphy". It's not the search algorithm being clever.

http://en.wikipedia.org/w/index.php?title=Torsoraphy&act...


"a pretty clear indication that they've special-cased clicks from google" - just like pretty much every piece of web analytic software since awstats has done for over 10 years, right? To imagine that _any_ half serious search engine isn't special-casing any referrer information it can get it's hands on from the top half of http://www.alexa.com/topsites is surely naive?


Well, no.

Imagine I am the Bing toolbar and I am spying on my users.

The Google search result for fake word zxczxczxczxczx is then a web page with a single link, and the user I am observing is clicking that link. So I am Bing, and I make a connection: zxczxczxczxczx (which appears on this web page) is related to said link.

Since the word is artificial and doesn't appear anywhere else, it's eventually going to produce that page as a search result.

So I think the author is right - what Google has done is they have proven that Bing toolbar tracks what users are doing, and sends the results back to Bing to improve their search.

They have not proven that Bing treats google.com differently from any other page. For that they should have seeded 100 random web pages from different domains with artificial words, and have these pages contain a link which the user with Bing toolbar installed then clicks. I bet that it would yield the same result as the Google test, e.g. those fake words would get matched to the linked web pages.


zxczxczxczxczx isn't on the web page. It's on the search results page, but that page isn't for crawling, by robots.txt. Either MS is ignoring robots.txt, or they're parsing the URLs.


From my reading of the article, zxczxczxczxczx isn't in the search results page either, it was special cased at google to display the honeypot page for that specific query.

But is Bing "parsing the url" any different to what Google's doing when I go into Google Analytics -> Traffic Sources -> Search Engines, and select Keyword in the second dropdown (after source)? Google are clearly showing me the search terms parsed out of the referrer urls from people who found my site in Bing.

I think this is perhaps not obvious behaviour to the general public, but surely pretty much anyone who goes to the trouble of installing the Bing (or Google) toolbar has worked out for themselves that they're choosing to send data like this to Microsoft (and Google), and that it'll get used to "improve search" if it's found to be useful for that?


No, they haven't worked that out. They're just "making their search experience better". They have no idea what this means.


> parsing the query out of the url

Or parsing any words found in the URL. Not unreasonable to believe.


If that were the explanation, I would think they would come out and say it. (Maybe they will shortly.) Even so, I find it hard to believe that they wouldn't notice that most of the benefit of the technique comes from recreating google results, at which point we're back at the original ethical question of whether that's ok.


most of the benefit of the technique comes from recreating google results

I'm not 100% sure this is the case.

Consider that an user wants a chicken enchilada recipe.

1. They go to a site that they trust for high quality data. (I like food&wine, saveur, epicurious, rick bayless for mexican, ...) 2. They search for "chicken enchilada" 3. They select the most appealing result (not necessarily the first one), based on author, snippet, rating, photo, etc. Domain knowledge.

If you can associate (via the referrer) "chicken enchilada" with that page, you've encapsulated a lot of information - manual selection of site, manual selection of a page within that site. It's potentially useful (especially for long tail - people who have to go directly to sites because search engines fail them).

Is there too much noise? Perhaps - you'd really have to look at the raw data to find out. Maybe you only want to look at "q=", "search=", and whatever phpBB, etc use. Maybe the signal that rises out of that noise is valuable.

One big downside is that the results would be biased towards the preferences of people who install toolbars. :) The other downside is that SEOs could game the system by feeding MS bad data. (True if they're only looking at google data, too.)


Not only does it not seem unreasonable, but for a company like Google, I find it incredibly hard to believe that at some point they wouldn't have tried it themselves, at least as an experiment.


If I were in a less charitable mood, I would suggest that the reason Google is upset is that they didn't think of it first.

More likely, though, it's a play to get Microsoft to come out and say more about what they track and how.


Actually, unless I'm mistaken, parsing the URL may well yield good search terms, but the query string is much less likely. Look at eg HN -- you'd parse out id = some long number. I may be wrong, but I don't recall many sites that would have good information for a search engine in that string.

What seems far more likely is bing is using click tracking on G's results. This was explicitly not denied by their VP on the search engine panel today. If not many sites except search engines have useful keywords in the query string, that pretty much validates Google's complaint.

In fact, if you go to Google's blogpost [1], the bing toolbar specifically calls out monitoring "the searches you do, the websites you visit, [...]" [1]. And the MS guy doesn't deny using clicks on G's search results [2]. In fact, the pretty much just says they copy G on long tail searches.

[1] http://searchengineland.com/google-bing-is-cheating-copying-...

[2] http://www.bing.com/community/site_blogs/b/search/archive/20...


> I don't recall many sites that would have good information for a search engine in that string.

Well, Google is one such site - as in, if you have bing toolbar installed and are on google.com/search?q=keyword and click on example.com, then Bing can easily extract "google com search q keyword" and associate it with example.com - without anything explicitly or intentionally relating to Google in their code.

They may also be looking at referrer info.


> I may be wrong, but I don't recall many sites that would have good information for a search engine in that string.

Any site which has a search engine of its own is going to have keywords in the query string such as: http://www.reddit.com/search?q=blah


I think their terms of use say they may pass along the contents of form fields. On a Google search results page, your search query is sitting in a text field.


Suppose it uses all text fields on the web without treating google specially.

Being as charitable as possible here, I'm willing to believe that some well-meaning engineer coded this up without special casing google. Tested it out, found that it worked amazingly well, and then launched it. This seems unlikely, but possible.

However, somewhere along the line someone must have known that the biggest benefit of this signal was recreating Google results. I'm not willing to believe that no one figured this out even if it wasn't the initial intent. At which point there's an ethical dilemma. At Google, a system like this wouldn't launch.

Regardless of the mechanism, I don't believe that nobody at bing knows that this is what was going on. Maybe it's a cynical attempt to get around robots.txt. Maybe it's an honest mistake that gradually became a dishonest mistake, but I'm not willing to believe that they are oblivious.


If you type the query [site:nytimes.com] into Google News, you've recreated a different presentation of a feed of latest news from the NYTimes. It is inherent in the search business that you're collaging material from elsewhere. And for certain heavily-qualified searches – long-tail, few mentions, hapax legomenon/'googlewhack' – a single source is likely to stick out.

Google is unavoidably a giant signal-source on the web. Even if Microsoft instead sent unique keywords to contract writers to build out findable summary/directory web pages one-by-one, what would those writers do? Research via other search engines, starting with Google, and be heavily influenced by the few (or top) results they found, highlighting the same sites. So your results would still percolate outward, via a slower, more expensive, more manual process. (Would that process, laundered through time and multiple agents, meet your ethical standards?)

Such is the nature of Google's position today. As Rich Skrenta of Blekko has put it: "The net isn't a directed graph. It's not a tree. It's a single point labeled G connected to 10 billion destination pages."

A little of Google's proprietary wisdom is leaking back out. The amount seems small compared to all the freely-offered info Google sucked in to create that wisdom. And, the proprietary wisdom is leaking back out via the same sort of bulk, automated mining of implicitly expressed preferences that for which Google itself is famous. So to me this seems more like karmic balance than an ethical transgression against Google.


There is a huge difference between recreating Google results and incorporating a Google honeypot link into their results through legitimate means. One link isn't a search result, a search result is an ordered list of results. If its whole search results then Bing's got some questions to answer and deserves some bad press. Otherwise, not so much.


Really? I doubt replicating Google long tail queries is the biggest benefit. User clicks seem like a useful signal of relevance for the same reason links from other sites are a useful signal of relevance.


Long tail queries are the ones which Google could easily demonstrate the effect without any confounding variables. I'd imagine this is affecting all of their ranking, as much as their own click data weighted by volume.


"They're either parsing the query out of the url, or violating robots.txt to fetch the result page, almost certainly the former."

Where's your evidence for any of that?

There's another way. Store the input of any Text Box which is immediately followed by a form submit,(search term), and then store the href of the subsequent click (the result which the user finds helpful).

The above is a way of handling it (naive and abstract) which doesn't target Google, and in fact would work for insite searches as well, or any search engine.

Since you've already made up your mind that your two ways were the only way of doing things, what's your opinion given the above?


I'm not sure why you are being voted down, the situation could be this. Also, Bing toolbar could simply be collecting tuples "(search_string_entered_in_toolbar, subsequent_href_clicked)" regardless of the search engine the user has specified and phoning these home... this seems like the easiest implementation (no parsing nonsense) and does not target google in particular.


Considering they're a search engine, whose job is to crawl the internet and determine the main keywords of any random page on the internet, I would imagine they have text parsing abilities that make it possible to infer the topic of any arbitrary page.


Google's outrage seems silly to me. They made a fortune harvesting other people's judgments on relevant webpages. That's what PageRank is.

So Microsoft takes their search results - it's all in the game! And Google wrote the rules!


But this is a matter of core functionality. Bing should not be using clickstream data from Google searches as a signal. The core offering of Google search is getting used by its competitor, without any citation. Yes, it's a side effect of the Bing bar's clickstream tracking, but it's still an observable, quantifiable effect.

It’s like if Developer A takes Developer B's code and passes it off as his own work without any citation. Does it matter if it was malicious or incidental? Not really, it's still wrong. And if Developer A gets caught doing it, he should own up to his mistake and fix it.


The way I see it, it's like if B builds entire applications by pasting snippets from Stack Overflow without citation, and then pitches a holy fit upon realizing that one of his/her amalgamations of Stack Overflow snippets has in turn been cut and pasted.

The boundary just seems arbitrary to me, which makes the whole thing seem hypocritical.


How so? Bing didn't steal any code from Google tocreate this. It can only up to be saying A uses B's site responses and put them into their results, with like 8% of the time.


"Bing should not be using clickstream data from Google searches as a signal."

Why not?


and take a look at inline dictionary definitions, page previews, images - the entire business is based on 'copying' other people's info. My question was genuine - what are you thinking, google?


Well gosh, developing ever more elaborate versions of the algorithm must have cost them tens, maybe hundreds, of millions of dollars over the years. How the hell are they supposed to cover that kind of expense with just billions of dollars per quarter?


OK, out of curiosity I actually tried installing the Bing bar (on IE9 beta, which, btw, makes you manually activate add-ins after they're installed. The installer for the addin itself is pretty upfront about it sending your click data and stuff - it's right there on the one and only options page, next to one of three checkboxes - though the box is checked by default which I think is dubious).

I haven't been able to influence the Bing search results (no surprise there, since I've only spent a few minutes on it and not weeks like the Google folks) but one thing I did find very interesting was that if you search for something on another site, the Bing bar actually lights up and populates its own search field with your query so that with another click you can search for it on Bing.

As far as I can tell, the bar doesn't seem to be using any heuristic to tell what is a search query but just has a list of sites/URL patterns it knows about. Besides Google and Bing itself, these include Wikipedia, Yahoo, Ask.com, Amazon, Facebook, eBay, YouTube, MSDN and IMDB, but not Twitter or DuckDuckGo. This doesn't prove anything about how it's feeding search results of course but it does at least suggest that the Bing bar is very interested specifically in search queries, though not only on Google.


I think the key to all of this lies in the code for the bing toolbar (and the code that parses its data).

If I search twitter for "binggate" then click on a link, the referrer would be:

http://search.twitter.com/search?q=Binggate

It wouldn't be hard to write a generic parser to detect URLs that look like search queries, and it'd be a novel way to gain a lot of information from private indexes (stack overflow, twitter, lucene/solr setups, reddit, etc).

If this is what Bing is doing, then kudos to them for clever thinking, no foul play.

However, if the toolbar has any logic to specifically parse Google's URL syntax, or if they're filtering and correlating google.com URLs against their own algorithms, then it's copying and foul play.

The surprising thing here is how much publicity Google's giving the issue knowing that Microsoft could shoot them down in no time with a few code snippets (if there's no foul play going on).


> It wouldn't be hard to write a generic parser to detect URLs that look like search queries

vs

>However, if the toolbar has any logic to specifically parse Google's URL syntax

So they're either generically copying from ANY search engine, or they're specifically copying from Google. Either way, the perception and actual outcome is the same. I'm not sure either is inherently good or bad, but I'm surprised that there is such a delineation in peoples' minds.


Interesting.

Take a site-specific search index like the one that reddit uses. Such an index can prioritize by votes, comments, users, fine-tuned spam filters, and so on. A search engine specifically optimized for reddit has a lot more information available when returning links than a generic web crawler ever could.

If Bing came up with a generic way to leverage site-specific search engines and help drive traffic to those sites, is that still perceived with negative connotations?


Excellent point. When I solve problems like this, I usually wound up with a general purpose solution and then test it out and if necessary optimize for important cases.


This piece is so bogus. If, as he suggests, Google's query data is accidentally getting caught up in some larger Bing project, then why doesn't Bing just say that in their numerous posts and tweets about the scandal? Instead, they've just thrown mud at Google.


I think the biggest question here is, if a site robots.txt file prohibits bots from gathering data, should that also prohibit bots that piggyback on real user sessions ?


>if a site robots.txt file prohibits bots from gathering data,

Pages in a robots.txt file still get put into Google's index they just don't get parsed. Surprised me to find that out; http://www.seomoz.org/learn-seo/robotstxt (see "Why Meta Robots is better than robots.txt").


I think it's an interesting question, but I think it will be technically difficult to differentiate between a person using data (say, with a plugin that indexes everything they browse, or a separate program that indexes their browser cache) and something like the Bing toolbar.


Good question. robots.txt was designed for an age before ajax and the proliferation of plugins/toolbars/etc. It's a valid shortcoming you've identified, imho.


Agreed. That ambiguity is why Google's PR response makes sense.


Can we please, please, PLEASE not call this Binggate?

Not everything needs a pithy one-word term of endearment.


Bingapocalypse?

Oh wait, that's still one word.


Very good post. I'd like to suggest another explanation: they really ''do'' see what Microsoft is doing as cheating, and expect others to share their outrage. When you're "in the bubble", talking only with people who share your perspective, it's easy to believe everybody things they way you do. And Google's known for being smarter than everybody else when it comes to search, so at some level people there probably believe that the only way anybody could get results as good as they are is by cheating.


A bubble like Microsoft Research for instance?

The comments on this subject seem very polarised and it is interesting to look at the background of the commenters.


Yes, although in some ways MSR is less in the bubble than the rest of the company -- people go to conferences and are aware of what's happening elsewhere. Microsoft's bubble applies just as much here as Google's: I'm sure a lot of folks there can't understand why Google or anybody else would think there's anything wrong with what they did here. When bubbles collide.

My charter back in 2006-7 was "game-changing strategies", which meant getting people to think outside the bubble. So we did a lot of work analyzing Google's, Yahoo's, and Microsoft's corporate culture. One of the things that's core to Google's identity is preferring algorithms to anything that has to do with people, and that's very much on display with Binggate.


My first thought was that it was click data, and not outright copying. If a bunch of Google employees with Bing toolbar start clicking on links to some made-up term, that should spike the data enough to change results within a few weeks. How can Google positively rule that out without internal knowledge of how Bing works?


With a control. Launch a fake search engine with similar spiked results but with no traceable connection to Google. If the results only show up on Bing from google-clicks and not from clicks on the control, then it's a good indicator that Bing have Google-specific code. Then run the test a few more times with greater than 100 queries (which in SE land is a miniscule test set).

From what they've said, it seems like they only tested against fake clicks on google.com. That tells us Bing are using click data but nothing more. This is a pretty simple debugging technique, which is why I'm shocked if Google didn't think to do that. I really wish I wasn't the one saying this, wish I didn't have these doubts. But I can come to no other conclusion than the ones I've outlined in the post.


If you can't come up with any other conclusion, I'm not sure you're really sorry about having those doubts. Here's a few alternatives off the top of my head:

0. There might or might not have been a control group, but its results didn't matter since:

0a) the whole purpose of revealing this was to make sure that blackhat SEOs could start abusing the system, and MS would thus be forced to stop doing it.

0b) the whole purpose of revealing this was to shame MS into stopping the practice, and for that purpose it didn't matter whether the system was specific or generic.

0c) describing the full experimental setup and the gazillion things that were tested would just have distracted from the core story

1. There was a control group and it suggested that the mechanism wasn't generic, it just wasn't mentioned because:

1a) it was held back as a gotcha in case Microsoft started lying about what the system actually did.

1b) positive evidence from the control group couldn't actually prove anything, it'd be suggestive at best.

2. There was no need for an explicit control group since:

2a) they actually observed the network traffic of the toolbar, and it only sent the relevant information for Google and not other sites.

2b) they disassembled the toolbar and found out it had Google-specific code related to this.

3. There was a control group and it suggested that the mechanism was generic, but:

3a) they thought that the the mechanism being generic didn't matter, and what MS were doing was still equally dodgy.

3b) they thought that MS would not be keen on trying out a "oh no, it's not just Google whose algorithms we're leeching off when spying on users, it's every other site too" PR strategy.

I have no idea of what actually happened. In all likelihood it was something not listed here, since these were just random ideas. But at least I think many of them are way more plausible than the silly "maverick super-senior engineers botch the job, leak a flawed story, PR coverup follows" theory.


That sort of "control" wouldn't necessarily be conclusive; Bing could be using click data weighed by a prior calculation of how trustworthy the clicked-on site is (e.g. an analogue of PageRank score), whereupon clicks on Google would rank much higher than clicks on some newly-created fake search engine.

I guess my point is, unless you have deep knowledge into the secret workings of a search engine with hundreds of inputs, simple "tests" to figure out the nature of just one of those inputs are bound to end in multiple plausible explanations for your results, unless you rig a huge fraction of the WWW. And Google has better things to do with their time.


I like to imagine that the bing algorithms are so smart, that it realizes the preciousness of the signal in clicktrack-logged visits to websites with a referrer "http://www.google.com/search?q=%s. This single feature is given so much weight that bing unintentionally gives the appearance of wholesale google duplication.


The article was interesting, but using Google and Bing as plurals really disorients me. I had to stop and mentally substitute every time he said something like "Google are thinking".


That's standard practice in the UK when referring to companies.


thanks, I was beginning to wonder if I was 'right' or not. Didn't know the standard was singular in US.


I think it's just the British English way of referring to companies or groups with plural verbs: http://en.wikipedia.org/wiki/American_and_British_English_di...


Huh. I'm British and I would always use singular for a company (not for sports teams necessarily ref: that wiki link). Are we really supposed to say "Google are having an earnings call on Tuesday"? I would say "Google is having an earnings call on Tuesday".


They're using the company name as a collective noun or as shorthand for "people at company X or product Y", makes sense to me (a Brit).


Finally some sense in all this.


Really interesting comment at Reddit about what Google might be thinking:

http://www.reddit.com/r/programming/comments/fd3g9/google_bi...

Apparently by creating fake results they've published something "creative" and thus potentially able to be copywrited. Just speculation but interesting.


The Reddit commentator appears to be operating on the idea that trap streets and the like are copyrightable under US law, which is a fairly common urban legend.

In reality adding a fake "fact" to a collection of real facts does not allow you to sue someone if they copy the fake fact along with the real facts; from the decision in Nester's Map & Guide Corp. v. Hagstrom Map Co., treating "'false' facts interspersed among actual facts and represented as actual facts as fiction would mean that no one could ever reproduce or copy actual facts without risk of reproducing a false fact and thereby violating a copyright ... If such were the law, information could never be reproduced or widely disseminated"


In Europe there are specific database IP rights.

Re the decision you cite it seems to be spurious. The notion of whether the data is factual or not is irrelevant. Was the presentation copied?

>would mean that no one could ever reproduce or copy actual facts without risk of reproducing a false fact

Unless they actual did some work and checked the facts rather than making a slavish, infringing, copy of someone else's work.

>If such were the law, information could never be reproduced or widely disseminated"

For limited terms of never (not limited enough mind you but nonetheless limited).

This is interesting though - if you can copy facts with impunity then can't I copy the "fact" of the score for Katy Perry's Firework for example?


The guy's phone book example is wrong, though. Putting fake entries in a phone book does not make it subject to copyright. Google or Bing for "Feist vs. Rural Telephone" for the leading case in the US on this.


>Apparently by creating fake results they've published something "creative" and thus potentially able to be copywrited. Just speculation but interesting.

And then given Bing explicit permission to copy it by accepting the terms and conditions of the Bing toolbar while installing it.

If you repeat the experiment with 2 made up results instead of just one but clicked on only the 2nd one, only the 2nd one would show up and not the first. They are copying the user action(for which they got permission), not the results directly.


9/10 desktops run Windows and IE still has almost 60% market share, despite being much worse than Firefox or Chrome (yes IE9 is a huge improvement). Microsoft could do a lot of damage to Google by leveraging their desktop monopoly. I think Google has around 70% search market share, but they're easily replaceable. By the time the courts sort it out, the damage will be done.


Why does he think that Google were not aware of the clickstream data when they set up (several) experiments where they specifically install what they thought were sources for the clicks and then purposefully clicked?


It's okay for MS to parse user input through it's toolbar, fine. However, it's not okay for Microsoft (or anyone) to use that toolbar to figure out what response any given server sends in reply to any kind of requests - unless it's part of a documented feature. You're only seeing the part where Microsoft is grabbing user input and not the part where Microsoft is grabbing Google data.


I think Google is right to be upset here. Google's biggest asset is their search results and Microsoft's response seems tepid at best. Effectively MS is claiming that they don't scrape google search results, instead they merely constructed an automated device which essentially does exactly that. If Bing doesn't see what's wrong with that then they are not terribly smart.


I've written a followup post addressing some of the common reactions to my first: http://www.puremango.co.uk/2011/02/what-are-google-thinking-...

On HN: http://news.ycombinator.com/item?id=2169690


I have a feeling there is more to this story than meets the eye. Either Google did more tests then they mention on the blog and proved irrefutably that Bing is literally scraping their results, or there's some underlying political stuff going on that we're not privy to.


Never mind about proving it - google isn't even saying it.

It's not scraping, it's click tracking.


Can't someone just analyze the Bing toolbar binary and figure out what's actually going on here?


The Bing toolbar only sends information back to Microsoft, it doesn't decide what to do with it. For this kind of a task it would be simpler to use a packet sniffer anyway.


Thanks for the clarification. Does the toolbar send a notification to Microsoft on every click? Or only clicks on Google SERPs?


The data's sent back via SSL, from what I've read, and disassembling is not going to be practical.


On the other hand, if it uses the OS HTTPS infrastructure, someone could probably add a root cert to IE and "MitM" themselves. If it doesn't, it's just a matter of finding and replacing the key; while that's not completely straightforward, it is done.

Darn, now it's really tempting to fire up some VMs. I don't have time for that.


I'm in Venezuela and it just didn't happened, I've parse HTML results from Google a couple of times and Google discriminates results from Address IP, Browser and Languague. So far, I think it's pretty wrong to say something like this without being 100% sure.


Whether it's wrong or not, it does make you wonder where Google's priorities are. How much time (cumulatively) did they spend on this and could they have spent this time improving their algorithms instead? Maybe it took a few hours, but still.


Is the next push in SEO to pay users to navigate from search results or other sites to your site?


[deleted]


Do you seriously think that a copycat service beating out and killing off the actual source of innovation is somehow a good thing?


[deleted]


In case you missed it, innovation is expensive.

If a copycat service can immediately clone the innovator's work and provide the exact same service, then they are likely to outcompete them, because they have much smaller costs.

To suggest that such freeloading should be encouraged removes any incentive to try and innovate in such areas.


[deleted]


You make something innovative, truck along carving out your niche for a few years, then a big company rips off your system and uses its massive marketing budget to crush you. Does not getting to massive-multinational stage in a few years mean you've done something wrong?

Or are you suggesting that it's only okay because they're doing this to Google instead of a little guy?


It's not necessarily a bad thing. That's what happened to the original IBM PC and it made PCs cheaper and led to the PC revolution. Did it suck for IBM and Apple? Yes, but the benefit to the world was immense.


Really? Would it be OK for MS to send home a copy of every Google Search Results that its browser [bar] sees and then use that as their results? Where do you draw the line?



it's tangentially interesting that the number 2 result on bing for 'google' is a washington post semi-hit-piece regarding spam. google's results for 'bing' contain marginally less transparent attacks .


I'm surprised google didn't just use this to poison the results.


I just tried what the blog said and it's simply not happening.


I honestly don't get the controversy of this thing.

Let's say Microsoft via its bing-toolbar is checking what pages people are visiting, what search term they are using on various sites (including google) and what they deem are the most useful results of these queries, and incorporating this into their search engine. Is this really so bad?

As it stands, Google is collecting this data about you almost everywhere on the internet, even if you use google or not. Besides straight Google search, think websites which uses Google's CDN for stuff like jQuery, Google analytics, etc etc. Google is everywhere and they are collecting data to incorporate in their search and ad-platform from everyone. End of story.

Heck, with google instant and their new JS-based search-gui you can't even get the referer information on your own website to see what search terms lead your users to your site. You now have to use Google analytics to get that information, and in getting that information you are helping Google even further in tracking everything everyone is doing on the internet. WTH?

Relatively speaking, is really Microsoft the bad guy here?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: