Hacker News new | past | comments | ask | show | jobs | submit login

Microsoft do NOT steal Google's search results. They use clickstream data from Bing toolbar users to improve their search results.

What happened is that Google engineers engineered a use case, in which signal from those clickstream appear like a stealing of search results. Or in other words, Microsoft uses Google users behaviour in their ranking algorithm. But that's not unethical, and Google is doing the same thing.




> They use clickstream data from Bing toolbar users to improve their search results.

If that leads to Bing absorbing Google's results, and eg. suggesting spelling corrections they would have never figured out except that Google thought of them first, then they are indeed stealing results, whether they meant to or not, and need to stop.

George Harrison didn't mean to copy the song "He's So Fine" when he wrote his song "My Sweet Lord," but he lost the lawsuit anyway and had to pay damages. Whether you call it "clickstream" or "user behavior," Bing is incorporating an association that could only have come from Google, and Google's robots.txt makes it very clear that robots are not allowed to mine search results.

> But that's not unethical, and Google is doing the same thing.

As has been mentioned many times, Google does not take user behavior from the toolbar as a signal for ranking.


> If that leads to Bing absorbing Google's results, and eg. suggesting spelling corrections they would have never figured out except that Google thought of them first, then they are indeed stealing results, whether they meant to or not, and need to stop.

Even if that's indirect information? For example, the Bing toolbar doesn't really say (in Gargamel's American voice for effect), "Nyhaahaha! I see these google search results! I will steal!" No, rather it says, "Ahh, after the result of a query to this site, we then leave the site to go to this page. If that query and this destination page appear in aggregate, Bing should take notice."

When I first learned how the Bing toolbar did this, I had two reactions: "Oh, that's clever. It's like engaging your users to help you be a directed crawler. It's 'querytext', which is not unlike google's innovation, 'linktext'." and then "But I would still never install the Bing toolbar. Man, that thing is a sad clown show." Honestly, it's one of Bing's smarter ideas.

There is only so far pure crawling and statistical analysis can go; there simply isn't enough data there and everyone knows how to use that data to great effect. Every search engine is incorporating new realtime communications and user behavior streams into their search results. Google certainly does this, albeit without a toolbar. Bing simply has the entire Microsoft software stack to lobby for help, so if someone opts into the Bing toolbar, they can opt into submitting additional information to improving the Bing index.

I'm not a big fan of MS or Bing, but in this they are only culpable for being clever. They are using querytext to improve relevance. I'm told you could make the same stunt work from a Wikipedia search box, leading to a wikipedia page.


Microsoft uses behaviour

I doubt they are only targeting Google. But I completely agree, the data is not being mined from Google, it's a relationship for search term a to page b created by the user, not stolen from Google.


Google is the one creating the association, not the user. The relationship between term a and page b was created by Google, and confirmed by a user. People don't click on search results that aren't there.


This is the most important comment I've read so far on this issue. What everything comes down to is whether user generated associations between searches and results via clicks are fair game as a signal.

As this comment points out, there is a decent argument that searchers on Google or other engines aren't the ones creating the associations, and therefore perhaps Bing has no right to use that data.


Google is not creating the association. Users are creating the associations, Google is discovering them and presenting them back to other users, and subsequent users are confirming the associations. In this case Google engineers, masquerading as users, created associations, then Google presented those associations back to the user, and Bing noticed the association between the term and the page.

The relationship between a term and a page was not created by Google, it was created by users. Google just indexes everything and makes note of these associations, but its does not create the link in the first place.


Users are creating the associations, Google is discovering them and presenting them back to other users, and subsequent users are confirming the associations.

Going down this road of debate, we'll be getting into the semantics of "inventing" vs. "discovering."

Turning data into ranking is the whole purpose of a search engine, just as turning data into theory is the purpose of science, and turning experience into art is the purpose of art. Essentially here, we're debating whether search ranking is more like science, in which there's a correct answer that you are uncovering, or like art, in which all of the product is subjective.

Having worked in the field for five years, I'd argue that it is far closer to art than science. Google's rankings are its subjective determination, and the courts have agreed that Google's rankings are its constitutionally protected speech.

Though the data may all ultimately come from the human-created internet, the transformation of that data is still important, and subjective. To claim otherwise is to miss the whole point of search technology in the first place.


My issue with it is this: Had Google not created a better algorithm would those users be clicking on those links? If Google shut down today would Microsoft be able to associate those sites with the search terms?

I should think the answer to the above is 'no' in both cases, which is why this is cheating. It's probably not illegal, but I find the practice to be unethical.

Ask yourself this: If Google shut down today, would Microsoft be providing more relevant, equally relevant or less relevant results for those searches? If the answer is less relevant, then I think it's clear there's been a lapse of ethics.


Ask yourself this: If Stackoverflow shut down today, would Google be providing more relevant, equally relevant or less relevant results for programming related searches? If the answer is less relevant, then I think it's clear there's been a lapse of ethics.


The difference is that Stackoverflow says "please index us" while Google says "please don't index us".


and they don't need to index or crawl google pages to do what they are doing.


I don't think it's so clear. Is drafting unethical?

In any case, for all you know, if everyone started using the Bing Toolbar, it may provide better clickstream data, causing more relevant Bing results.


But the effect is the same as intentional scraping and outright stealing: search results that could have only come from Google are appearing as Bing results.

Bing needs to blacklist Google from its clickstream. Simple.


...the effect is the same as intentional scraping and outright stealing: search results that could have only come from Google are appearing as Bing results...

That's one effect. While it's vivid, it might be a tiny side-effect only notable in contrived cases.

Overall, this kind of URL-after-URL signal, extracted from every participating user, and every trail through both search sites and non-search sites, might be discovering valuable terms-in-preceding-URL-to-later-visited-URL associations. These associations might result in many search improvements, other than the one-for-one result porting Google's experiment has found. We don't really know the relative magnitude of porting-results versus other-benefits, yet I think that's important to the analysis.

If a useful automated or user-driven process generates a little indirect infringement around the edges, is that enough to demand the process be stopped entirely? Note, that's not the standard Google wants applied to user uploads to YouTube, or excerpting of news and websites onto Google services. Google says: "defend yourself, by adding opt-outs (robots.txt) or sending takedown notices, and we'll undo the incidental infringements eventually".


the effect is the same as intentional scraping and outright stealing

The google engineers intentionally sent this click data to Bing, so is Bing really stealing? It's odd to act surprised when Bing uses the data that was intentionally sent to it. Bing could specifically ignore Google search results pages when it is tracking clicks, but is that legitimate? Google scrapes everything, why shouldn't Bing?


They were testing to see if it was true. Many users are doing it, bing is taking data for non-google engineers too.


The point of collecting click data is not to target google engineers, it is to collect data from masses of people doing regular searches and to improve them by seeing which links get clicked on, so obviously Bing is "taking data for non-google engineers". Furthermore, there's no indication that google search results pages are even distinguished from other pages in this.

In fact, the data that it takes from Google engineers for carefully engineered corner-case searches is the exception.


I don't even think they should just blacklist Google. They should just respect robots.txt.

edit: I should have clarified. I know that the Bing crawler likely respects robots.txt, but if they are using clickstream info to build their index, it seems right that they should respect robots.txt there as well, no?


I'm pretty sure the Bing Crawler does respect robots.txt. The data Bing collected didn't come from spidering Google.


You could strongly argue that collecting clickstream and other user browser session info via a toolbar is not a form of web robot (crawler, spider, etc.), and thus robots.txt does not apply.


I agree with your comments that toolbars should respect the robots.txt because even if a human is doing the crawling, it is still an automated system that is indexing information from that site. I would not want toolbars attempting to send data back to Bing based on my queries on a company Intranet or a site that would normally not be indexed. Personal data entered into what the toolbar thought was a query field could be sent onward as well even if the robots.txt on the site restricted it. I think they should respect robots.txt in this case even if they are only monitoring user behavior.


Would Bing then get called out by Google for the inevitable 'anticompetitive' lowering of Google's Bing search rankings?


Actually, Google denies that they do the same thing pretty emphatically. Basically, they claim that the same test in reverse would not work, because the data stream from their toolbar in a reversed test would not affect their algorithms in this manner.

But agreed that it seems like a very small nit to be magnified the way it has. Indeed, why they don't do this, and why we should care that Bing does, doesn't seem to be directly addressed other than by hand-wavery and PR speak about the research they've put into their algorithms and such.


I do not understand this.

* Are Microsoft saying end users are naturally clicking on nonsense phrases invented by Google?

* If not, what are they saying?


What's not to understand? Microsoft is colecting search phrases from searches people make though the IE search box. Then they are tying that data to the next page people click to after the search. Google intentionally polluted their data using odd phrases and asked users to click on their bogus results. Given that the only data bing would have for those strange phrases is polluted crap from Googlers it's not surprising that their engine that's using user supplied data would attempt to return the only match it can find.


Since nailer's comment on my other post can't be directly replied to.

From googles mouth

http://googleblog.blogspot.com/2011/02/microsofts-bing-uses-...

"We gave 20 of our engineers laptops with a fresh install of Microsoft Windows running Internet Explorer 8 with Bing Toolbar installed. As part of the install process, we opted in to the “Suggested Sites” feature of IE8, and we accepted the default options for the Bing Toolbar.

We asked these engineers to enter the synthetic queries into the search box on the Google home page, and click on the results"

In all honesty I'm a bit wrong as they did say they used the google box on the page.


What may be misleading about Google's statement is that it paints the picture of engineers hunched over laptops patiently typing in nonsense queries a keystroke at a time.

I guess I just have trouble seeing it taking more than about 3 minutes before 20 of the top industry engineers figured out a way to automate the process. Which is pretty long compared to the 2 seconds it would take for them to start thinking of ways to improve the SEO.

It's the human factor in Google's "experiment" that just doesn't fit. If they wanted a controlled approach, they would have written an application and run it and logged everything. Instead, it appears that they provided laptops so that the engineers could experiment and innovate their way to exploiting the Bing toolbar.


> In all honesty I'm a bit wrong as they did say they used the google box on the page.

Thanks - that's what I've been getting at. This isn't data entry into the Bing toolbar, it's into a non-Bing page when one has the toolbar installed.


> What's not to understand? Microsoft is colecting search phrases from searches people make though the IE search box.

This is not correct. From Google:

"We asked these engineers to enter the queries into the search box on the Google home page"

Google did not enter the data into the IE search box.

Edit: I see you replied to yourself acknowledging the mistake - please ignore this then!


> Google asked users to click on their bogus results.

Got a citation for that?


We asked these engineers to enter the synthetic queries into the search box on the Google home page, and click on the results, i.e., the results we inserted

http://googleblog.blogspot.com/2011/02/microsofts-bing-uses-...


Engineers are generally not considered end users.


The test wasn't natural. A number of Google people did click on the links artificially associated with the nonsense phrases, while they had Bing's toolbar active.

If the Bing toolbar is picking up on this sort of thing generically (i.e. picking keywords out of the query-string on any page and associating them with clicked links, though I'm not sure how it could with a useful degree of accuracy in a way that couldn't be "maliciously" gamed buy underhand SEO activities) then I see nothing wrong in it as long as the users have knowingly opted in to their activity being analysed in this way. It would just be indexing keywords and content just as a web spider would.

If it is specifically detecting that it is on a Google page, and/or other search competitors, than the issue is much more cloudy.


As it's been pointed out, Bing compares more than just the clickstream datapoints. You have to imagine they provide some PageRank/domain weighting to the relevance. So if you setup a dummy site with no existing pagerank/weight, and performed the same experiement, you likely would not see the same results. However, since you can imagine Google is heavily weighted, those data point score high and can rapidly reflect in Bing search results.


But it only worked 7% of the time despite 20 Google engineers' best efforts, the honeypots being ranked number 1, and the data repeatedly sent to Bing.


The Google engineers who planted the phrases were running the Bing toolbar themselves. That part isn't in question.


So MS are saying Google themselves uploaded the data to MS by using Bing toolbar?

Does Google agree their engineers did this?


Yes. From the original article yesterday:

This all happened in December. When the experiment was ready, about 20 Google engineers were told to run the test queries from laptops at home, using Internet Explorer, with Suggested Sites and the Bing Toolbar both enabled. They were also told to click on the top results. They started on December 17. By December 31, some of the results started appearing on Bing.


No. There's a difference between having the toolbar installed but typing something into the google.com directly (which actually occurred) and typing something into the Bing toolbar.


Even if you exclusively use Google's web interface to search, the Bing toolbar gathers anonymous data about your browsing (if you've got that option enabled). It's fairly clear about that when you install it. To my knowledge, all of these search engine toolbars gather statistics on a broader scale than specifically what's typed into their search fields, Google's included.


Dude, I've seen your comments on this thread. Several of them are basically asking people to confirm facts everybody agrees on. You should come better prepared for the next discussion.

Sorry to be an ass. But you're wasting my time on this site, since I have to wade through your questions to get to the interesting ones.


There's something most of HN are missing. You are indeed being rude by not letting me point it out:

* Google having the Bing toolbar installed and entering the search into Bing toolbar is one thing (and I'd expect MS to be using the data)

* Google having the Bing tolbar installed and entering the search into the Google hom page is a very different thing (and I'd expect MS to be using the data)

Judging from the moderation in this thread, people seem to think the first happened.

According to Google, it did not. No other source contravenes this.

Sorry if you think me pointing this out is bad. Perhaps your efforts would be better reporting all the non-hacker stories on the front page?


I know this has been posited, but has it been confirmed anywhere that this is the case?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: