Hacker News new | past | comments | ask | show | jobs | submit login
Bing sets the record straight on recent accusations (bing.com)
169 points by thankuz on Feb 2, 2011 | hide | past | favorite | 186 comments



From the Google blog post:

"This [torsorophy] example opened our eyes, and over the next few months we noticed that URLs from Google search results would later appear in Bing with increasing frequency for all kinds of queries: popular queries, rare or unusual queries and misspelled queries. Even search results that we would consider mistakes of our algorithms started showing up on Bing."

That's the key. Not the honeypot. The honeypot was just a test to see if they could catch them red handed.

Bottom line is that Google looked at the statistics and found that Bing results were improbably similar to Google results. Maybe they're lying about this, but I doubt it.

There's not enough information to figure out exactly to what extent Google impacts Bing results, but I would bet a lot of money that in the Bing code base there is Google specific code that behaves along the lines of "if on google, do x" rather than some generic code that just targets all sites.


I don't think anything about this is clear-cut. I can imagine a very simple algorithm, as follows:

* When query Q gets made more than N times at bing.com,

* Mine clickstream data for the next M urls requested after searches for Q

* Any url that appears more than T times (possibly spread across some number of users) is presumed to have been found relevant to Q, and derived either from later searches (corrected spelling) or curated sites or other search engines. Add to mapping of valid responses to Q.

It's not a very good algorithm, of course, and if you have any other source of information about Q you're probably better off using that instead. But it or something like it could explain the torsorophy example and every other part of Google's narrative, and it's not particularly suspicious or questionable, and it certainly doesn't involve targeting Google.


Except that in this case, the honeypot searches were not done at bing.com


No, but they must have also been doing regular searches at bing.com in order to test their claim. Where do you think they got their bing results from?


It doesn't matter, because the results (from the honeypot) had nothing to do with the searches, so it would have been impossible to clue Bing in directly by searching on Bing a lot.


Are you claiming that they made the queries without actually making the queries? Reread the first line of my algorithm: once you identify a query Q as more than just a one-off mistake—whether it's an actual new item or a common misspelling or, maybe, a trap—then you decide it's worth looking into.

Put another way, it's impossible not to clue Bing in on at least the fact that you are making these searches.


That is exactly what's being claimed. The queries were not made on bing.com, they were made on google.com. The only way Bing can become aware of the results of these google.com queries is if they're "spying" on the user's activity via the Bing Toolbar and IE8 suggested search features.

From http://goo.gl/Bi0JH (Google blog):

"We gave 20 of our engineers laptops with a fresh install of Microsoft Windows running Internet Explorer 8 with Bing Toolbar installed. As part of the install process, we opted in to the “Suggested Sites” feature of IE8, and we accepted the default options for the Bing Toolbar.

We asked these engineers to enter the synthetic queries into the search box on the Google home page, and click on the results, i.e., the results we inserted. We were surprised that within a couple weeks of starting this experiment, our inserted results started appearing in Bing. Below is an example: a search for [hiybbprqag] on Bing returned a page about seating at a theater in Los Angeles. As far as we know, the only connection between the query and result is Google’s result page (shown above)."


The Bing toolbar tracked user clicks on google.com search result and added it to Bing. Of course the user had the option to opt out.


I can think of an even better algorithm:

1) When a user enters google.com/search? , scrap the page (pay special attention to their spell corrector)

2) Send all the data back to MSFT

3) Profit nicely!

It's simpler, and probably works as well as the best competitor out there ;)


All search engines look for keywords in URLs. That's why domain names with keywords in them are more expensive. If you're going to mine clickstreams from the toolbar, it's not far-fetched to think you would mine the entire source URL, including the query string, for keywords. Then there would be no need to postulate special treatment for Google search URLs, since the keyword is in plain sight (google.com?q=xxxxxx).

If that is what's happening, then the moral/ethical argument against Microsoft would have to be that they should treat Google specially, by explicitly ignoring clicks on Google's result pages. That seems to me to open up quite a can of worms.


I'm wondering, if this is really the case, why does Microsoft not just point that out, but instead comes with ad hominem arguments?


To crawl Google URLs of the form google.com?q=x would be to disregard http://www.google.com/robots.txt , which seems like bad netiquette to me.


They aren't crawling, just noticing what pages clients who visit google.com?q=xxx go to next.

If anybody's search toolbar checks a site's robots.txt before sending clickstream data, I would be very surprised.

A client-side robots.txt rule would also make anti-phishing features trivial to bypass...just put a robots.txt on your phishing site.


I'd play devil's advocate here and say that I give Bing a slight benefit of the doubt. Given they cannot perform a parallel Google search on every single user query (this would have been easiest to spot), the technique cannot be of the form "if on google, do x". Rather, it is likely they look at what users search and what results they get as one of the parameters, say normally influencing 0.2% of the search ordering on average (assuming 500 "signals"). In the case of "torsorophy", out of the 500 "signals" only one came up, the Google result (via Bing toolbar from a few users). So, their algorithm makes a note of that, like, hmm... I don't really have anything else so I'd just rehash what users have searched and subsequently visited. So the 0.2% becomes 100%, and internally Bing associates the word "torsorophy" with that website. Next time "torsorophy" is searched it returns that site.

Yes, Bing is likely using Google's data (if available, via Bing toolbar) as part of the ingredients. You can argue whether this is stealing or not, whether this is Google results or user search/visit behavior, and but I would say it isn't exactly "if on google, do x".


> I would say it isn't exactly "if on google, do x".

I wish bing had taken this opportunity to answer that question in this blog post. Since they didn't, it makes me more likely to assume the worst. All they would have to say is something like "We use anonymized click data and don't special case Google" to avoid most of the controversy. Then we would be back to debating whether it is ethical to do something theoretically ok even if you know that it practically means you'll be cribbing from a competitor.


One thing is sure. This episode makes Microsoft look like a cheater, but also Google like a child. I rather Google spend more time on combating search spam and content farms than this concerted "caught-in-the-act" ambush. Instead, Google likes to cheapen themselves with endless argument with everyone (how supporting WebM and Flash is open, how the iPhone world is "draconian", etc).

Just call it "imitation is the most sincere form of flattery" and be done with it, would you, Google? Spend your time improving UI, defend your cash cow against spam, focus on making Android better, be more coherent in your social/location strategies, lead the industry in privacy (not just talk of privacy). All of these will benefit users more.


It is all fun and games, and a bit childish. Obviously google are not offended, just trying to get some PR about it - just business.

But the argument "I rather Google spend.." can be applied to anything - it isn't as if they haven't be tackling other problems, and they are clearly worried about bing (at least in PR, bing spends big - lots of MS employees posting on forums, including here, which I am sure is encouraged). Bing is losing a lot of money for Microsoft, so by that token, I could say that they should stop it, and spend money on xbox (which is fantastic - no idea if it is losing money, but what a leader) - make it stupidly cheap, make it dominate the home etc ...


You can also argue that Bing fits in the "offense is the best defense" category. Hit Google where it hurts so Google can't cannibalize MSFT's core business as quickly. From a customer's perspective, it keeps Google honest.


Let's go one step further from your argument:

Why should Google improve their search spam algorithm? Would they gain anything if MSFT then their compares their own results against Google's, to delete the spam farms thanks to Google's work?


I can't agree with that. You always want to improve, even if there is no "competition". Don't worry much about competitors that only imitate, but pay attention to those who innovate and exceed you. So, exceed yourself before others do.


> "the technique cannot be of the form 'if on google, do x' ... Rather, it is likely they look at what users search and what results they get"

How do they look at what a user searches and what result they get? Ok, the "click stream" can see what pages a user visits.

But to get the search term, they are parsing something, and unless they implemented a universal search recognizer that will rank up results from any old site's search (allowing 20 SEO guys to push private SERPs to the top), it seems more than probable the parser indeed would start with "if on google, do x".


"But to get the search term, they are parsing something"

referrer in the http request?


Yup, it's as simple as that :D


Most referrers do not contain search terms. You can't look at REFERER and infer it indicates the subsequent URL is correlated on a given keyword. If that's what you want, you write a parser to look at URLs, recognize human-determined search indicators (q=, s=, search=), and correlate the subsequent URL with the indicated keywords.

Hand waving that this is "clickstream" data doesn't mean Bing is not looking specifically for Googled search terms and the resultant user selected URLs.

I personally don't mind them doing it. I just think their post is non-responsive in hopes of persuading the non-technical reader there's "no deliberate copying to see here" when they're clearly intentionally ingesting an extraordinary volume of Google-keyword-to-Google-result data and using that data to map keywords seen only on Google to results on Bing.

Flattery, etc.: http://www.wired.com/epicenter/2009/06/kayak-bing/


If Google analytics can scrape this data, so can bing. Google never answered if they are using clickstream from google toolbar / chrome or not. Check this simple script if you want same data. http://forums.digitalpoint.com/showthread.php?t=1680579&...


a general tokenizer might do the job.


The problem is that the honeypot doesn't prove anything. These are rare terms by design so clickstream data is all probably all Bing has to go on, and that clickstream data was voluntarily submitted to Microsoft by Google employees.

Google need to go to some extra steps to show that Bing is copying Google links for popular terms. If Bing is weighting clickstream data from Google searches very highly, that is more or less admitting that other search engines work better.


The most I can deduce from the experiment is that Bing look at what people click on Google and that plays some role in their search ranking. If you synthesize a term that doesn't exist in nature I can see how a search algorithm can weigh the only datapoint it has, data coming from the toolbar, heavily. This may not the best approach, but a far cry from copying, which is what Bing are being accused of.


This seems to be mostly ad-hominem arguments in the classic definition of it: "You're wrong because there's something bad about you," rather than addressing any particular point. I don't see how questioning Google's motivations for uncovering this is doing anything to "set the record straight."

A useful post would have addressed at least 3 things:

- What are the specifics of the mechanism by which they ended up obviously copying google's results?

- Do they handle clicks from google differently than any other clicks?

- How different would their results be if they didn't use clicks from google's search as a signal?


I'm about 80% in the "Nothing fishy here" camp on this issue, but you have pointed out the 20% very well right here.

Obviously they won't explain the first one, just as Google won't explain the specifics of why StackOverflow scrapers sometimes outrank StackOverflow, and for good reason. Explaining the specifics of the mechanism would open it wide to spammers.

The second? Yeah, that's the million-dollar question. They should answer that. Definitely. It touches on the smelliest thing about the issue. If I had a search engine (I don't), and I wanted to copy Google's results (I wouldn't), and I had the ability to collect user click data, I could use that click data to create plausible deniability for the copying. This is exactly the sort of thing Microsoft has done in the past on other issues.

The third? It's certainly relevant, but I doubt Google would be willing to tell me how its results would be different if it didn't use a specific metric.


- What are the specifics of the mechanism by which they ended up obviously copying google's results? They said is was user click stream data. User types search into Google, gets results back and clicks link. Microsoft gets sent the search term and link clicked.

- Do they handle clicks from google differently than any other clicks? Hopefully they do if they are using user clickstream data. Each domain should have something like a page rank to determine how trustworthy it is.

- How different would their results be if they didn't use clicks from google's search as a signal? Why would they not use Google. Google is a site on the internet where users click links, Microsoft is collecting data from sites on the internet where users click links.


> Hopefully they do if they are using user clickstream data. Each domain should have something like a page rank to determine how trustworthy it is.

If there's some algorithmic reason to believe that Google gives more trustworthy results, that's one thing, and the pagerank weight can be empirically determined. If, on the other hand, there's some point where they explicitly treat google clickthrough data any differently (if site=='google.com' weight+=10), then that seems to have crossed a line.

I'm trying and failing not to have this sound like some kind of koan, but if you treat everyone differently in the same way, then that's (arguably) fine. It's only if Google gets some specific attention that it seems more malicious.


What if that weight was empirically determined by an algorithm instead of by a human, would it make it OK or should Bing hardcode in a demotion to that google-based data source?


Did we read the same article? I like Google and crew more than Microsoft, but this was pretty clear to me:

> Google engaged in a “honeypot” attack to trick Bing. In simple terms, Google’s “experiment” was rigged to manipulate Bing search results through a type of attack also known as “click fraud.” That’s right, the same type of attack employed by spammers on the web to trick consumers and produce bogus search results. What does all this cloak and dagger click fraud prove? Nothing anyone in the industry doesn’t already know. As we have said before and again in this post, we use click stream optionally provided by consumers in an anonymous fashion as one of 1,000 signals to try and determine whether a site might make sense to be in our index.

Beyond that, it's kind of crazy to think they'd open up on nitty-gritty details about their algorithm - nobody does that.

Anyways, I'm still with Google. I hope Google wins. But attacking Bing here was probably a tactical mistake - pretty much all marketing thought ever says "Don't attack-market against upstarts if you're the market leader!" You can't win if you're #1 and you do that. Google's #1. Attacking Bing was a really bad tactical move, though I'm still casually rooting for Google to win.


"- How different would their results be if they didn't use clicks from google's search as a signal?"

That's the important one. These Bing people are happy to run their mouths about 1k signals blah blah blah. Who gives a fuck? What is the weight on those signals.

Further, this case was made obvious because Bing couldn't create an answer to a query so they copied Google's answer. Dude admits as much. Why should we give Bing a pass on copying Google's results just because in this instance it was too hard to find their own results?


The weight on keywords that exist nowhere else on the internet other than the fake searches Google injected?

Well, if there's one source of data, that data is weighted at 100%.

The whole experiment is meaningless.


Have the Bing team expose their algorithms and criterion to the public so they can settle a spat with Google over 7/100 made up words appearing in a set of search results?

That sounds reasonable.


>over 7/100 made up words appearing in a set of search results

Again, that's not the allegation, that's the evidence. The allegation is that bing is using data that essentially amounts to a wholesale copying of google's results.


You missed the point I was making, which is that asking a company to fork over expensive information that a competitor like Google would kill for in the name of answering a question about 7 queries is disporportionate.


Forking over data is not required. Making a clear public statement would suffice.


Expensive information? Please. I'm not asking for all weights, I'm asking for 1/1000 weights. If it's tiny, then sharing it should be inconsequential. And no bullshit about competitors or spammers please -- it's not as if anyone is unaware that ranking well in Google organic results is good :rolleyes:


What would be the value in knowing that weight without knowing what it's relative to?


I thought it went without saying that the relevant coefficient should be taken from a unit normal vector or some other indication of relative magnitude should be given.


I prefer expectation (over typical queries) of weighted variance of the feature, divided by the total weighted variance of all features. (most likely sqrt(variance) instead, but whatever power you prefer).


>"What is the weight on those signals."

Low enough that it took Google nearly three engineers per successfully injected honeypot (7 honeypots per 20 engineers) and Google was only able to achieve a 7% success rate despite their extensive in-house knowledge of SEO.


I don't understand what you're trying to say here. The Google blog post says that they inserted 100 honeypots into Google. Then it says "within a couple weeks of starting this experiment, our inserted results started appearing in Bing."

Where are you getting the 7% number?


the more detailed searchengineland.com post mentioned that they only got 7-9 of the 100 nonsense terms they tried to inject to show up.



Three searches is nothing.


I've seen this 1k signals BS repeated over and over by HN commenters, as if somehow copying from 999 other guys makes this a nonproblem.


so maybe you should stop using all search engines, because they by definition do not create any data and just copy it from other people.


I wouldn't say it was obvious, mainly because this was something Google did specifically to cause this. I would imagine you could replicate the same results with other major search engines simply by doing the same thing with the same amount of volume as the Google dudes did.

Additionally, if things were reversed and Google was posed these questions, I would imagine would be lacking just as many answers as what Bing is supplying.


> I wouldn't say it was obvious, mainly because this was something Google did specifically to cause this

There is a subtle point that many posts like this are overlooking. Google didn't run this experiment to cause bing to show bogus results, they did it to confirm the rise in suspiciously similar results produced by bing.


  Google didn't run this experiment to cause bing to show bogus results, they did it to confirm...
There is a counter argument to that. If Bing's claims are true, then Google didn't run the experiment to confirm the results, they ran it to cause the results.

I think 99% of HN, including myself, will share your opinion, but it doesn't counter Bing's claims because it boils down to "which company do we believe?"


Google: we found that bing is copying us. To prove it, we ran this experiment.

Bing: doh, of course we use google results, but we told you that before in an obscure academic paper. Didn't you read that? And we are innovative - shiny pictures!


I don't understand what are Bing's claims that make it so just went Google did this they were copying results.


>Additionally, if things were reversed and Google was posed these questions, I would imagine would be lacking just as many answers as what Bing is supplying.

That's like saying you'd be evasive like OJ Simpson if you murdered a few people and were on trial.

So if Google used Bing to rank searches it would also be under scrutiny? Since that's not the case we can all avoid the useless thought experiment can't we?


Google uses click data in their ranking algorithm. It's a fact. Pick a low volume, low competition niche, have 30+ of your friends search for the keyword and page through the results until you find a result (same for everyone), have everyone click through and not bounce. You will improve that site's rankings.

Chrome tracks clicks and traffic. Google Toolbar tracks clicks and traffic. The difference in the situation is that the majority of people use IE to search on Google, not the Google toolbar/Chrome to search on Bing. Bing has more data to work from than Google does in regards to this exact type of click tracking.


> Google’s “experiment” was rigged to manipulate Bing search results through a type of attack also known as “click fraud.

hunh?

> So big and noticeable that we are told Google took notice and began to worry. Then a short time later, here come the honeypot attacks. Is the timing purely coincidence? Are industry discussions about search quality to be ignored? Is this simply a response to the fact that some people in the industry are beginning to ask whether Bing is as good or in some cases better than Google on core web relevance?

I can't believe this was written by someone with 'Vice President' in their title ... how childish.


Read google's "Bing Sting" experiment on Google's blog. Their test had no control variable (ie: running the same test on another website that wasn't Google). If they had done their expirement on another site and Bing results didn't change then Google would have a very plausible case against Bing.

Google fell into the classic trap of confirmation bias with bad scientific method. And thus the test was 'rigged', they never had a control variable.


Their assertion was not "bing is using clicks on google search and only google search." The assertion was just "bing is using clicks on google search." They've demonstrated that pretty conclusively.


Bing has never denied using clicks on google search. They have, in fact, admitted to using clicks in general before Google even began their experiments. Given that information, why should it be surprising that Google Search, as one of the most-clicked sites on the Internet, has a big impact on that?


Surprising or no, it's clearly pretty controversial. If you are using click data like this, you have to know that you'll end up essentially copying Google. People only click on the results that are there, and Google puts them there.


Of course it's controversial. Someone made a deliberate decision to stir up controversy over this. It's easy to make some controversy if you just use the right words. Words like "Cheating" and "Copying" are great for that.

If you really want examples, just watch Fox News or MSNBC for fifteen minutes. You'll probably see at least one or two examples in there somewhere.


Their assertion is that they're imitating Google.

"Bing results increasingly look like an incomplete, stale version of Google results—a cheap imitation"

Which they have NOT demonstrated. Their results can easily be interpreted that they imitate user clicks.


You can't decide to reason through induction on one set of variables (all searches, not just "hzzxsqqdga", are copied by Bing) and leave it out on another (click data on all websites, not just Google, are copied by Bing).


I'm not doing that. How did you get that impression?


The phrasing of "bing is using clicks on google search" implies that google search is a single case (I could likewise claim "bing is using clicks on duckduckgo") that does not extend to others.

Wouldn't it be more accurate to say that "bing is using click data"? The fact that google.com is in a lot of that click data is a questionable decision and the root cause of all this drama.


Yes. Why can't people grasp this very simple idea before opening their mouths and spreading this FUD. I see even pg fell for this.

The only way the words "Bing copies google" would be justified is if MS were directly querying Google on certain keywords and ripping off search results. Google have provided no evidence to suggest this. I expected the commmenters of Techcrunch to be unable to grasp this, but it seems that HN is often like this too.


To be fair, I think that claim would be justified if they were "merely" grabbing the Google SERPs their users happen to receive, turning them into ranked lists of URLs, and using that data some way.

It would even be justified if they were harvesting click data only from Google (or explicitly treating that click data differently), because then they're just doing the last one, but obfuscating it: it would be like refusing to bribe a politician directly, but instead making a large "investment" in a corporation they own.

It's not a justified claim if they built a mechanism that genuinely gathers interesting data, and would continue to do so in Google's absence. I think Microsoft is claiming this, but their responses have been so murky that it's not 100% clear. Google certainly hasn't produced evidence that renders this version implausible.


Your "control variable" only matters if the hypothesis they are trying to prove is "Bing special-cases Google."

That is not the hypothesis. The hypothesis is "Bing has results in its index that it could not have gotten in any other way than from Google search results." Their experiment does indeed confirm that hypothesis.


The problem is, the intentionally ambiguous and misleading wording Google has been using implies the test was "Bing special-cases Google". This is why I'm not buying any of this. Google is smart enough to use precise language when they want to, and apparently not to when they want to.


Incorrect, once again to come to that conclusion you would need to test on a site that WASN'T Google as well.


If your analysis is correct, you should be able to explain a scenario under which, given Google's experiment, Bing's result for "hiybbprqag" came from somewhere other than Google.

What is that scenario?


It came from the Bing toolbar tracking the user browsing. Yes, obviously "hiybbprqag" came from Google. But that's because they only tested it on Google.

They never tested the fact that it could have come from any other website as well. Thus, they can't conclude that Bing is copying Google or whether its copying the user's browsing behavior.


> Yes, obviously "hiybbprqag" came from Google.

That's all Google is saying.

> They never tested the fact that it could have come from any other website as well.

It doesn't matter, Google is only complaining about what Bing has copied from Google. What Bing copies from other sites is between them and the other site.


Exactly. They never did any testing designed to specifically not have the result show in Bing.


Which means they do not know the boundaries of their problem. What part of googles argument is countered by it though?


It means that they have falsely arrived at a conclusion due to positive bias. They might be right, but they haven't proven it sufficiently.

The inverse of the conclusions of their experiment are also incorrectly assumed (Google results, using IE8/Bing toolbar, make Bing results != Bing results, when using the IE8/Bing toolbar, are from Google).


My understanding is that running the experiment on some random other sites would work as well as long as Bing user's were actually clicking the honeypot links. This is probably the intended functionality of the click stream data. That doesn't change the fact that using this type of data from competitor search engines results in "stealing" (for lack of a better word) search results, particularly for rare terms such as "tarsorrhaphy".


What would be a suitable control variable? How would one know if Bing is using results from that website too or not?


So what exactly would a control that satisfies you be? How about I submit search terms on the computer terminal that connects to a server that isn't even on the internet. There, those searches do not end up affecting Bing.

The fact is that they used unique search terms that link to unique subjects. They ONLY submitted searches using what they described. That means the only places those searches passed through were the OS, toolbar, IE and Google. They know Google received those searches, did it propagate to Bing?

That's it. They didn't need to offer a placebo to anyone. It's like throwing a ball and hearing an echo. You don't need to NOT throw a ball just to make sure it doesn't echo.


> We have been clear about this for a couple of years (see Directions on Microsoft report, June 15, 2009).

That's about as unclear as anyone can be. Seriously Microsoft, this is the Web. When you want to point your readers to a document, don't fucking cite it like some printed academic paper! LINK TO IT.


It was published on Direction on Microsoft[1], which is supposed to be an "Independant Analysis of Microsoft Technology & Strategy".

The yearly subscription is $1500.

If [2] can be transposed, non-subscribers can also get the report for a whopping $750.

I couldn't find that report in their archive,though. The only article published on june 15 2009 is [3], which is unrelated.

[1] https://www.directionsonmicrosoft.com/

[2] http://www.reuters.com/article/2009/04/21/idUS223966+21-Apr-... :

[3] http://www.directionsonmicrosoft.com/update/50-june-2009/678...


Here's the problem in a nutshell:

The Bing toolbar uses clickstream data to extract information about the relatedness of urls, just like any search engine crawler does (page rank works this way). When a user clicks on url B from url A the Bing toolbar sends information back to MS about the relatedness of urls A and B including all the meta-data involved. If url A is ".../search?q=torsorophy" then Bing will make a note of that and will start showing results for B when you search for "torsorophy".

In principle this isn't necessarily a bad thing, it allows Bing to index sites that it wouldn't otherwise, letting its search index grow more organically. However, when search engines come into play things get problematic, because now the Bing toolbar is little more than an automated method for scraping search results piecemeal. Given that a great deal of modern web surfing falls into the "search for X, click links for X" pattern, this should have been something that the Bing engineers anticipated (if they didn't anticipate it that's bad enough, if they did and ignored the problem that's much worse).

Worse yet, the Bing toolbar is effectively a search indexer which does not respect robots.txt. Let that sink in for a while. Google.com/robots.txt has this line: "Disallow: /search", and yet apparently the Bing toolbar has absolutely no compunction about effectively ignoring that.

tl;dr: MS has created a search indexer which ignores robots.txt, this is bad.


Yours is the first explanation I've seen that makes any case for Microsoft not being intentionally cheating. All the responses I've seen from Microsoft are in the form "we're not copying Google we're just using user click information" which is poor since what Google showed is that they're associating click information with the search terms that were put into Google. But since those search terms are in the referrer URL it could conceivably be an innocent general algorithm weighing the words in the URL.

As for ignoring robots.txt, that may not be the case. Conceivably you could get the url A->B link, save the metadata for both, signaling them as related, and then check both URLs against robots.txt to see if you should have them in the index. Then if url A is ".../search?q=torsorophy" Google's robots.txt disallows it from being indexed and only url B gets in but the link to "torsorophy" is still there from the metadata.


In fact what Google users are really clicking in searches are "google.com/url?" URL's which are also disallowed in robots.txt(while the url they redirect to aren't).


Indeed. Certainly it is technologically possible for clickstream based indexing to still abide by robots.txt rules. However, the Bing toolbar does not. That is the key issue here.


How do you know it does not? I'm assuming robots.txt is about preventing the page contents from being crawled and added to the index. If all they use the click info for is to associate referrers (google URLs in this case) to pages in the index and they don't crawl the google search itself I don't see how that breaks the robots.txt contract.


The page contents are being crawled and added to the index, but by Bing Toolbar users, not a computer program. I consider that to be an underhanded way to circumvent robots.txt, but others might not.


What makes you say that? We haven't seen anything to indicate Google search pages are in the Bing index.


"...the Bing toolbar is effectively a search indexer which does not respect robots.txt..."

maybe i don't understand, but your logic seems correct until the above statement.

suppose you had a system that only used clickstream data. so you store a big list of url pairs (A,B) and a probability that B will follow A. i believe that your argument relies on the fact that it's possible for this system to violate a robots.txt file and i don't yet see it.


The intention of robots.txt is to tell search systems specifically "do not use information from the following pages in building a search index". The Bing toolbar's use of clickstream data from google, and no doubt many other sites, clearly violates that spirit.

This could easily be fixed, by checking the clickstream data against robots.txt files and discarding data that shouldn't be used. Microsoft apparently has decided not to take that step.


your assumptions:

  - the "intention" of the robots.txt standard is as you state
  - the url is included in the information not allowed by that standard
  - if the url is not included it should be because of the "intention"
  - toolbars are subject to the same standards
I'm not disagreeing with you as much as just pointing out that I don't think EVERYONE agrees on these standards.


Err, robots.txt doesn't even enter into the picture. It would matter only if Bing's server was requesting pages off Google's site. That is not happening. The only things that are going are the Google URL and only the clicked URL(not the other Google results that the user didn't click on).

The user gave permission to Microsoft to use this info by installing the toolbar. I don't see why or how the Bing tool bar should visit Google.com/robots.txt to see the blocked folders unless they were crawling Google pages.


The contention you're making, and apparently Microsoft as well, is that the Bing toolbar is not a search indexer and does not need to obey the conventions of robots.txt?

Personally I don't think that flies. Should I be able to create a toolbar that rips the data from hulu.com and hosts it elsewhere?


this is not my area of expertise but i really don't think that's a proper analogy. a more relevant phrasing would be, "should i be able to create a toolbar that inspects (previously/explicitly) requested documents from hulu for external links". and i would be inclined to say, "yes"...


But it's the perfect way to circumvent robots.txt! With enough users the law of large numbers applies (and how many hundreds of millions of users are using IE?) so you can a.s. copy Google's index.

In particular, you copy the most popular results, so if you weight the % of google's index that you copied, then it's going to be a huge number, and almost for free.

It's evil. But devilishly smart.


This is the usual FUD we are used to from Microsoft.

First just claim that the allegations are wrong (use 'Period.' and 'Full Stop.'). Don't try to explain!

Than redirect allegations as a personal attack (say it is 'insulting'). Don't try to explain why, either!

Refer to a document that requires paid membership, so only very few people if any can check (don't use a direct link). Gives you cheap credibility.

Claim that the experiment is fraud, although no one benefits financially. Don't back up the claim with data.

Counterclaim that your competitor is actually copying from you, ignoring the difference between ideas and data.

Now that the reader envisions you as the poor underdog that is insulted, is a fraud victim, and is copied, spread the doubt that Google is doing this because it starts to 'worry' (pose an open question like 'is it a coincidence'?).

End with the pose that your are not affected by this and concentrate on your business.

Never, ever answer the allegations with a simple explanation on why Bing shows bogus results because some Google employees search while the Bing Toolbar was active.

{Overlook that you admitted that your are prone to click fraud.}


Thank you for summing up why posts like this one from MS make me queasy. Lots of spin and posturing and insinuation when a clear, technical answer is needed.

It's like, "Son, did you steal those cookies?" "Mom, Justin always accuses me of stealing cookies, and I've been getting good grades lately, and last week Justin hit me for no reason, and don't you think he should be in trouble too?"

Answer the question, son.


I have a much lower opinion of the Bing team after reading this non-denial denial. It's disingenuous to act surprised that using clickstream data from your main competitor might be called copying.


And I have much lower opinion of the Google team (and they had much further to fall). They intentionally use ambiguous and loaded words like "copying" specifically to stir up drama. Google is smart enough to use precise words when they want, and not to when they want. The ambiguity was intentional.


I don't think it's ambiguous. They are contending that bing is intentionally using this to recreate results similar to google's. I'm guessing you feel this is ambiguous because you don't think it is supported by the data, not because they are saying anything unclear.


Characterizing it as "copying" is what's ambiguous. This can easily be seen by reading most discussions about this. People are arguing past each other, armed with their own personal definition of "copying". The facts are that make-believe search results showed up on Bing after Google intentionally tried to train Bing to pick up the association. Declaring this to be "copying" is begging the question.


> The facts are that make-believe search results showed up on Bing after Google intentionally tried to train Bing to pick up the association. Declaring this to be "copying" is begging the question.

Well, it was trained by having the Bing toolbar see people clicking those results on Google's search engine.

But I do think you're right that people are talking past each other and what outrages one person could very well be something that another simply doesn't care about.


This is the record:

Searching for "torsorophy" on Google brings up the Wikipedia page for Tarsorrhaphy because of Google's advanced spelling and error correction algorithms.

Searching for "torsorophy" on Bing will (or used to) bring up the Wikipedia page for Tarsorrhaphy because of Google's advanced spelling and error correction algorithms, because Bing watches how people use Google.

Two different kinds of innovation at play, one much more legitimate and impressive than the other. But it's not like Microsoft is in unfamiliar territory here.


So does "fast forier twansform" -- did Google honeypot that term too? Or the other 20 terms I just tried with arbitrary mispellings?


When Bing adjusts the results it shows due to user click-ranking, it does so without any notification. That's because its a ranking issue.

When Bing adjusts the results as due to a spellcheck done by their engine, it notifies the user so they can correct it.

On your example, Bing search does exactly this: > We're including results for fast fourier transform. Do you want results only for fast forier twansform?


I completely agree, but to quote the person I was replying to: "Searching for "torsorophy" on Google brings up the Wikipedia page for Tarsorrhaphy because of Google's advanced spelling and error correction algorithms."

Both companies have good algorithms for spelling and error correction, but neither appears to be a superset of the other. (For example, Bing corrects "Kransas" to "Kansas", but Google does not).

I was addressing the fact that Bing has at its disposal some advanced techniques too. It's not just Google that does. And I illustrated this by pointing out that there are a lot of queries that do the spell correction on Bing that obviously weren't honeypotted -- as seemed to be implied by the original comment.


The difference there is probably a result of varying internationization.

"Kransas" is Swedish for wreath, which is likely why Googles' spellcheck doesn't automatically correct it.

Bing however won't even show me proper results for Kransas even if I tell it "+Kransas".

One example of where they both 'fail' on an mispelling is "munday" which isn't autocorrected to Monday because munday is a proper noun existing in the english corpus---e.g. the City of Munday, the Munday family, etc.


Now that we have confirmation that Bing uses clickstream data on sites that they do not control, I'm surprised that no one mentioned this makes them sensitive to click-fraud.

Copying a leader isn’t as objectionable as denying it and I would argue that Google could split their search engine into independent entities—say: crawling, algorithm, interface. That way, competitors could try to outrun them and innovate on one or the other, separately: Firefox, Siri and many others on the interface; Cloudera or Amazon on the crawl, DuckDuckGo and physics labs on the algorithm (just spitballing there). What Bing did wasn’t that different from a grease-monkey script that would supplement bing.com results with Google‘s, it seems — or at least, they could have had the humility to present it that way.

Now, their pride made them confess to being easily gamed.


Do we have any information about whether Google uses clickstream data from users traveling between non-Google sites?

As I noted in a previous thread (http://news.ycombinator.com/item?id=2166256), Googler Amit Singhal's wording was a bit vague about whether URL trail data from things like the Google Toolbar, Analytics, Ads, or other systems ever affects rankings. (His careful wording, "put any results on Google’s results page", could mean merely that URL trail data never adds results to the set of all possible results. It could still affect rankings of pages found through other crawling.)

The use of such data is clearly allowed by Google's very broad privacy policy. I would love a clear answer from Google on this. It should be easy to give.



Interesting, thanks, but doesn't address my question. To be clear, I'm wondering: are clicktrails from Google Toolbar, Google Analytics, Ad programs, or other sources (beyond just Google Search outclicks) ever used to help calculate search rankings?

Robots.txt-sensitive crawling is irrelevant to this question, and whether their toolbar tracks clicks from other search engine result pages is only tangentially relevant, as one small example of the general idea.

And I'm not asking if they've ever done exactly the click inference Bing has done. Rather, I wonder if they're doing vaguely analogous indirect mining of revealed web user preferences via clicktrails. For example, noticing which sites were visited together or in certain order even without crawler-visible links between them. Or noticing which pages were viewed for the longest/shortest times. Or which pages seemed to 'end' a purposeful session. Or other deep-science stuff I can't even imagine.

I don't want them to reveal any proprietary secrets – just whether they have ever used (or would consider it legitimate to use) all the clicktrail data from all their many non-search tools to help with search quality.

Because I've long assumed that they do, and would be surprised if they didn't.


this has been answered many times already - check the other threads.


If so, please link to a definitive statement addressing this from a Googler. I've been looking and haven't found it.


> Now that we have confirmation that Bing uses clickstream data on sites that they do not control, I'm surprised that no one mentioned this makes them sensitive to click-fraud.

I would expect that they know this. The question is how susceptible does it make them? The clickstream is just one of something like a thousand factors that go into the ranking. Perhaps on any real page, it would take more fake clicks than a fraudster can generate to make a worthwhile difference.

Keep in mind that Google was experimenting with pages for which Bing had no information other than the clickstream, so the clickstream made a big difference there.


>> "As we have said before and again in this post, we use click stream optionally provided by consumers in an anonymous fashion as one of 1,000 signals to try and determine whether a site might make sense to be in our index."

Erm, the Google test took pretty non-sensical keywords and put at #1 some completely unrelated website.

And then Bing copied this unrelated website result from Google.

The above quote from Bing (especially the bit in italics) is therefore a little confusing. Since, again, the entire point here is that Bing shown completely unrelated websites taking straight from Google. The results weren't relevant. Thus making the quote fairly contradictory in my opinion.

Yet another 'reply' by Bing, and yet again they seem to be trying to insult their way out of it via ad hominem and strawman arguments.

And yet again, Bing still haven't addressed the key point here: why Google results were taken and used by Bing.


Bing sees that I searched for a term, 'iwejhoihfe' on google and ended up clicking a link to foobar.com/page there, this gives foobar.com/page a relevance with the term 'iwejhoihfe'. Since the 'iwejhoihfe' doesn't show up anywhere else on the web, that single relevant site gains a lot of weight for that term.

This is how Bing's algorithm may have worked, and I wouldn't call it intentionally copying, flawed maybe and easily manipulated just like Google Bombing in the old days.


One difference between Google's vs MS's approach. While Google's first post was much more rational (with proofs) and convincing to tech minds, this post seems more a politician's attack on Google with vague and scary sounding words like "click fraud". Only if some evidence for the above "click fraud" were supplied!!


This whole debacle is dragging both companies down. Overstatement and intentionally framing the story by Google, vague answers and changing the topic by Microsoft.

Though in the end, Google probably gets a net win cause they got "Microsoft copied us" headlines from some major sources with probably little follow up from the average reader. But the whole situation could have been handled better.


I can't speak for the masses but this is a net lose for Google in my eyes because it paints them as being extremely hypocritical IMO.

Keep in mind that we're talking about Google here. The Google of Google News, the web-corpus assisted translation engine, the Google book scanning project, wifi scanning, etc. Not that I'm negative on any of those projects, but based on the amazing degree to which their business piggybacks off the work of others it does make it difficult for me to take them seriously when they come out swinging at somebody else for piggybacking off of them, especially when the piggybacking going on here appears to be fairly minor and incidental.


The post indirectly confirms Google's claims. Otherwise, it would have been easy for Microsoft to point out where Google is wrong and everybody would blame Google for being stupid.

Instead, Microsoft chose to attack Google 'personally' with non-substantiated claims.


Just once it would be great if tech companies could lay off the hyperbole and posturing and just give it to us straight. Bing copying Google? Not really. Honeypot attack? Give me a break.

Here's what appears to be the truth:

The Bing toolbar keeps track of what people click on after doing a search. Even when it is from sites that aren't Bing. If it is a search for something obscure (or made up) that Bing has no other data for, then that clickstream data can affect Bing results for that term. Most of the time, however, that is just one of thousands of inputs. It is useful data, but not a defining part of the Bing engine.

This was obviously a huge PR move on Google's part. The information was released to a massive search engine blog right before a big search engine event with both Google and Bing in attendance. Bing could have come out of this shooting straight and looking like the more mature party, instead the come across like they are unsure of themselves and have been backed into a corner.

Score one for the bullshit artists at Google.


First, I have friends working for Microsoft as well as Google. I like and hate both companies, for different reasons. Now, to the meat of what I wanted to get to: "We have brought a number of things to market that we are very proud of -- our daily home page photos, infinite scroll in image search, great travel and shopping experiences, a new and more useful visual approach to search, and partnerships..." Seriously?? Are you going to put "daily home page photos" as (first!) point to be proud of? Is it such a technological feat? Kudos for the rest of the items, but I just found it funny


The first comment to the blog post is humorous.

  Bing, "Powered by Google". It does have a nice ring to it.


Another, quite tongue-in-cheek, humorous app found in comments:

http://createasearchengine.appspot.com/


That's a nice one!!! Over 1000 thousand!


"anonymous click stream data"

It can't be only that. The page where the clicks happen must be parsed for the click be linked to the query and reused at their search engine.

They transformed: "from google.com/q=X, people tend to click on example.com"

Into: "when searching for X, people tend to click on example.com"


They not only parsed the search queries, they also copied Google's results and added it to their own results.


Monitoring the clicks and parsing the data should be enough to account for the results we see.


One really interesting thing about Bing watching google searches and changing their page ranking based on users clicking links... it implies one way to "SEO" (game) Bing page ranking would be to use Amazon Mechanical Turk to raise a page's rank on Bing by having lots of Mechanical Turks doing google searches and click-throughs on the target page.

It would be interesting experiment to estimate how heavily weighted the clicking is vs. the other 999 inputs. I would think a very obscure page would be fairly easily manipulated (Google basically showed that with their "honeypot"), but a more popular page with lots of non-zero entries in the other 999 inputs would be a lot harder to game.


Whether it was a type of attack also known as “click fraud,” or not, it does appear to be an attack. One can argue that the attack was justified - and one can argue that it's not - but it still was a deliberate effort to improve page rank in a way similar to one which Google actively discourages.

The fact that only 7/100 of the honeypots were successful may mean that Google's motives may have been more complex and slightly less akin to righteous indignation - such as determining how and to what degree Bing handles "click fraud" type attacks. I just have a hard time believing that Google would turn 20 engineers loose on a task that can be so easily automated.


>Whether it was a type of attack also known as “click fraud,” or not, it does appear to be an attack.

I disagree with describing this as an attack. Based on the kinds of search terms Google has said they used, they aren't attempting to make the Bing results any worse. There is simply no good search result for "hiybbprqag", so Google pointing to the Wiltern Seating Chart is just as good as pointing to nothing at all from the purposes of what any user who happened to Google hiybbprqag randomly would see.

At best, you could make the argument that by revealing that Bing is susceptible to these kinds of attacks, they are making it easier for others to attack Bing. It's certainly illegal to rob a bank, but what about walking down the street yelling, "The bank guards are all gone! The combination to the vault is 1-2-3-4-5!"


"what any user who happened to Google hiybbprqag randomly"

But it wasn't any user, it was a Google engineer and the circumstances were not ordinary. Those engineers were actively trying to get the results into Bing over the course of at least two weeks 12/17-12/31 according to searchenginland.com.


So Google, by doing this, has created up to a hundred or so bogus entries in the Bing database, to which I say, "big whoop". If Google had to pay damages for the extra infrastructure that Microsoft requires, the stamp to mail the check would be more than the check itself.

What I would classify as an attack is if, through a similar process, Google introduced bogus entries for real search terms, which they don't claim to have done.


>"which they don't claim to have done."

IANAL, but if they have attacked Microsoft, that is probably a prudent legal strategy. Keep in mind that Google claims to have discovered this episode by noticing similarities between their results and Bing's which although it suggests and active program to monitor Bing, is hardly surprising.

On the other hand, however plausible it appears their claim about how they discovered it smells a bit of BS. It strains belief that Google never looked at the packets the Bing toolbar was sending home during browser compatibility testing. They lost their virginity a long time ago.


Not only that, there's too many hypothetical, open ended questions - they don't really address anything. Instead they leave it to the reader to imagine. Not impressed.


Bing is admitting their results are affected by click fraud then??


Not quite - they are admitting that click fraud on google.com affects their results!


And those of us with our own DNS servers know exactly where google.com is.

I wonder how smart the Bing Toolbar really is.


You probably don't need a DNS server. /etc/hosts should be plenty. Maybe they're smarter than that, and I think they are if you try to mask Microsoft's own domains, but I doubt they check to see if Google.com has been changed, though that could certainly change.


Yes, but only on hapax legomena :)


To me, it seems quite simple. Someone at Google used Bing toolbar and IE8 in following way:

a) Search from Bing toolbar for hiybbprqag.

b) Bing does not display anything.

c) Bing toolbar remembers that term internally.

d) Go to www.google.com and type hiybbprqag.

e) Google displays one intentionally seeded link.

f) Google engineer click on that link.

g) Bing toolbar notices that shortly after user searched for hiybbprqag, s/he clicked on seeded link.

h) Bing toolbar sends that piece of that to mothership: There is relationship between hiybbprqag and seeded link.

...

i) After some time, Google engineer searches for hiybbprqag from Bing toolbar.

j) Bing looks up its index and there only one piece of evidence regarding term 'hiybbprqag'. It is not much, but it is all it has, so it presents it to the user.

Google accusations strongly imply (using words like 'stealing') that Bing simply scraps google.com for results, while reality is not so simple.

Now, bigger Microsoft problem is that it employs VPs like Yusuf, who cannot express simple facts and easily fall into corporate speak.

UPDATE: Here is good summary of what I wanted to say: http://directmatchmedia.com/google-proves-bing.php


Article appears to be down, fortunately you can still access it from Google's cache: http://webcache.googleusercontent.com/search?q=cache:www.bin...


From a logical debating point of view, this post is really weak. If I were an engineer at Google, and I caught Bing at this, I would be upset, to the point I couldn't sleep. This is one of the lowest things you can do in academia, and even though this is business, the accomplishments of the engineering team at Google is at the highest point of information technology. As a businessman I could have taken this, not as a hacker/problem solver.

"It was interesting to watch the level of protest and feigned outrage from Google. One wonders what brought them to a place where they would level these kinds of accusations." Feigned outrage... one wonders... If you really not understand this, you are not a hacker, you are not an engineer. Maybe it was your actions that prompted two valued Google engineers (Singhal and Cutts) to make these accusations?

"Before we explore that (so you gonna prove this point later on?), let me clear up a few things once and for all.

We do not copy results from any of our competitors. Period. Full stop." The end result is a copy of the results of a competitor. If you borrow it, steal it from your users, copy it from a guy in a raincoat with binoculars, observing Google results and jotting them down -- It doesn't matter! OK, let's say you don't copy results. Your index still contains copied results for a fact.

"We have some of the best minds in the world at work on search quality and relevance, and for a competitor to accuse any one of these people of such activity is just insulting." Bing copied Google's fake results. Your engineers either deliberately or accidentally made Bing copy those results. If you want less insulting: Google claims your minds are the best at copying others.

" We do look at anonymous click stream data as one of more than a thousand inputs into our ranking algorithm. We learn from our customers as they traverse the web, a common practice in helping to improve a wide array of online services. We have been clear about this for a couple of years (see Directions on Microsoft report, June 15, 2009). " That will make me trust your index a lot less. If engineers clicking manually on some results can make a difference in ranking, imagine what a botnet or dedicated spammer can do.

" Google engaged in a “honeypot” attack to trick Bing. In simple terms, Google’s “experiment” was rigged to manipulate Bing search results through a type of attack also known as “click fraud.” That’s right, the same type of attack employed by spammers on the web to trick consumers and produce bogus search results. What does all this cloak and dagger click fraud prove? Nothing anyone in the industry doesn’t already know. As we have said before and again in this post, we use click stream optionally provided by consumers in an anonymous fashion as one of 1,000 signals to try and determine whether a site might make sense to be in our index. " Ad hominems: Honeypot, trick, quotes:experiment, rigged, manipulate, attack, click fraud, attack, spammers, trick consumers, bogus, cloak dagger, click fraud, prove? nothing. If you need those words in a paragraph to describe how you got caught with your hand in the cookie jar, then you already lost.

" Now let’s move the conversation to what might really be going on behind the scenes. " Yes, move the conversation away from your duplicate results, the key issue at hand.

" Bing was launched nearly two years ago .to break new ground and help move the search industry in new directions. We have brought a number of things to market that we are very proud of -- our daily home page photos, infinite scroll in image search, great travel and shopping experiences, a new and more useful visual approach to search, and partnerships with key leaders like Facebook and Twitter. If you are keeping tabs, you will notice Google has “copied” a few of these. Whether they have done it well we leave to customers. But more importantly, we take no issue and are glad we could help move the industry to adopt some good ideas. " copies marketing paragraph. Rebuttal: Google didn't use its toolbar to "copy" your pictures and add those to their own background.

" At the same time, we have been making steady, quiet progress on core search relevance. In October 2010 we released a series of big, noticeable improvements to Bing’s relevance. So big and noticeable that we are told Google took notice and began to worry. Then a short time later, here come the honeypot attacks. Is the timing purely coincidence? Are industry discussions about search quality to be ignored? Is this simply a response to the fact that some people in the industry are beginning to ask whether Bing is as good or in some cases better than Google on core web relevance? " Bing is absolutely horrible for international searches. You serve up 40 different language Wikipedia pages for some terms. But you are saying, that Google deliberately timed your outings? That is like saying your brother saw you stealing cookies, but waited till he got a bad report card, before telling mom. And I thought a bruised academic ego was childish. So your rebutal amounts to: But mom! He got a bad report card! And that while you are struggling making your grades yourself.

" Clearly that’s a question that will continue in heated debate as long as there is a search industry. Here at Bing we will continue to focus on our customers, and try to provide some great innovation for consumers and the industry. " That's not the question and certainly not core to this case, which is your duplicate index. This is unprecedented in 10 years of search engine competition, and you want to make it about the timing? Let's make it about the 1000 ranking factors vs. the 200 ranking factor from Google. Lets make it about the million other results this could be happening for, but are impossible to prove with a "honeypot attack" and specific keyword? Lets make it about search relevance, how we are spoiled with relevant results, yet complain about the most relevant index on the web, one you use to calibrate long tail terms and spelling corrections?


"It was interesting to watch the level of protest and feigned outrage from Google."

This was the phrase that almost made me want to do another round of posts. As the person on the panel from Google, I can assure you: it was not feigned outrage. It was real frustration.


>The end result is a copy of the results of a competitor. If you borrow it, steal it from your users, ...Your index still contains copied results for a fact.

This is pure nonsense, and so is the rest of your post. [edit: The business of] search is not about academia, or algorithms, or engineering, its about providing users with relevant responses to their queries. The best way currently to do that is through crowdsourcing--which is exactly what Google does through Pagerank. Now Bing took that a step further and creates an association between what a user searches for and what they end up finding relevant: This is exactly what we currently call "search". Whether that initial directory of sites came from Google or from some hand culled list is inconsequential. This is not "copying", this is doing exactly what they should be doing.

The point is Google doesn't own the link, or the form input, or the fact that many users clicked on that link after posting that term. The only thing they could be reasonably construed to "own" are the relative rankings themselves. Bing did not "copy" this.


"This is pure nonsense, and so is the rest of your post." Thank you. I don't really know how to respond to this, but maybe say that your own statement is so weak it just contains its own counterargument.

"[The business of] search is not about academia, or algorithms, or engineering, its about providing users with relevant responses to their queries"

And the food industry is not about the ingredients, the cooks and the recipes, but about providing users with a dish suitable to their tastes. In fact, we could do without the cooks, recipes and ingredients. er...?

"The best way currently to do that is through crowdsourcing". These are fine words to say that Bing uses rank on Google as 1 in 1000 ranking factors. Bing uses GoogleRank. Or CrowdRank in less polemic terms.

To me: it isn't about if Microsoft is "copying" or not anymore. You can win that semantic game. What you can't win or explain away is the end result: the duplicate results and grammar corrections. The process by which this happened (Microsoft can hack my microphone and record keystrokes) is irrelevant. Bing uses rank on Google as a ranking factor and Bing contains duplicate results.


>"This is pure nonsense, and so is the rest of your post." Thank you. I don't really know how to respond to thi

I apologize for the unnecessarily mean-spirited reply, it was uncalled for.

>And the food industry is not about the ingredients, the cooks and the recipes, but about providing users with a dish suitable to their tastes. In fact, we could do without the cooks, recipes and ingredients. er...?

You're wrongly combining "food" and "the food industry". Food is about the end result, regardless of how it came about. The food industry is about all those things you mentioned. The same goes for search.

>it isn't about if Microsoft is "copying" or not anymore. You can win that semantic game.

You're right about this, "copying" or not is purely a semantic game. The problem I see is that Google started it intentionally. They could have used more precise terms and actually started a conversation about the real issue here, which you correctly identified, as Google results showing up in Bing. But they chose to go the sensational route.

This is my reply to moultano that addresses your point:

The value of a search engine isn't any particular result, or any set of results. Its the quality of all the results over time. If Microsoft's algorithm picks up a tiny amount of signal (ahem 1 of a 1000) indirectly from Google's results, this does nothing to artificially inflate their position off of Google's back. There's nothing inherently wrong about using user signal for this.

There are many sites on the internet that generate a set of links based on form data. Google is one of many in that respect. This technique is effective in gathering search information on this "deep web". Special-casing Google positively or negatively is the wrong approach here.


>Bing did not "copy" this.

People only click on the results that are there. Google put them there. People tend to click on results in roughly the order they appear on the page, which Google also determined. Getting the data through an indirect means does not insulate them from culpability.


In my opinion it does. The whole purpose of this technique is to get better results than just an algorithmically generated list. If people are simply clicking on the first result, then it's not accomplishing what they want. Having a mirror image of Google's results is not what they want (otherwise why would anyone switch?). We all know how gamed the first few results on Google are these days anyways.


>otherwise why would anyone switch?

Marketing and flashy features, two areas where bing has been investing a lot of money.


The value of a search engine isn't any particular result, or any set of results. Its the quality of all the results over time. If Microsoft's algorithm picks up a tiny amount of signal (ahem 1 of a 1000) indirectly from Google's results, this does nothing to artificially inflate their position off of Google's back. There's nothing inherently wrong about using user signal for this.

There are many sites on the internet that generate a set of links based on form data. Google is one of many in that respect. This technique is effective in gathering search information on this "deep web". Special-casing Google positively or negatively is the wrong approach here.


The problem is that the small amount of signal that Bing picks up from Google carries more weight for the more rare associations that Google has worked so hard to help users find. Spelling mistakes stand out here quite a bit.


I don't really see this as a problem. This is in fact what search is all about. Some human makes an association between A and B, and a major search engine picks up on that. Google has algorithms for picking up on these associations, and so does Bing. It just so happens that some of those associations have passed through Google's servers before reaching Bing. The point is that Google is not creating these associations. They're algorithmically picking them up from others, just as Bing is doing.

Btw good job with the disagree downvotes guys.


It's not Ad Hominem because the Bing team isn't personally insulting Cutts or Singhai. If anything, it's just using malicious sounding buzzwords. A honeypot isn't even isn't even a type of attack: it's a type of defense.


"It's not Ad Hominem because the Bing team isn't personally insulting Cutts or Singhai. If anything, it's just using malicious sounding buzzwords. A honeypot isn't even isn't even a type of attack: it's a type of defense." They not directly insulted Cutts and Singhal, but they offered a fake rebuttal of their actions/proof/views, peppered with malicious sounding buzzwords.

...have nothing to do with the logical merits of the opponent's arguments or assertions... The proof was concocted by these engineers.

...but can also involve pointing out factual but ostensible character flaws or actions which are irrelevant to the opponent's argument...

The way his statement could be read is: Matt Cutts and Singhal are no more than spammers, tricking Microsoft with cloak and dagger tactics that prove nothing. Which constitutes a clear Ad Hominem.


The post said my feelings were feigned (definition: "to give a false appearance of"). My feelings weren't feigned.

(Also, it's spelled Singhal, not Singhai)


Apart from his flustered tone, Mehdi said nothing that Harry Shum didn't say in his post yesterday. I don't know why he bothered posting this.


Why are people suddenly using the word "honeypot"? A honeypot is something intended to be broken into. Nobody broke into anything in this story. There were no honeypots involved. You might call the search terms "tracers" or something.

Another thing that did not occur is click fraud. That term already means something, it involves pay-per-click advertising, and it doesn't have much to do with Google's experiment.

Don't let this guy distort the language we use to discuss the issue.


Regardless of whether or not one agrees with Yusuf Mehdi's conclusions, I can certainly understand his passion in defending his team. Google's accusations yesterday were very accusatory and polemic (e.g. "We look forward to competing with genuinely new search algorithms out there—algorithms built on core innovation, and not on recycled search results from a competitor. " ouch).


I mentioned this in another article, but I have yet to see evidence that demonstrates that bing is parsing google's queries either from a) the url of the referrer, or b) directly on the google results page, yet many people on HN have asserted as such, generally exhibiting a false dilemma fallacy.

An alternative is that the bing toolbar is collecting 2-tuples "<search_string_in_toolbar, next_href_clicked>" and sending these back to microsoft (regardless of the search provider selected). I would consider this "click stream data" and seems to agree with statements in the above article. Additionally, since it doesn't involve parsing hrefs it seems like the easier solution (and the one I'm going to tentatively assume by Occam's razor).

Just to be clear, it does not have to be the case that the bing toolbar is collecting data entered directly into the google search box. Indeed, google could have tested for this by having a control group where they entered search queries without using the toolbar search box, however from the details released so far we cannot conclude they did this (read: google hasn't released enough specifics about the test they conducted, and it's far from being reproducible with the current details).

Additionally, why do many in the HN community think this google specific? I understand that google isn't claiming "bing copied just google" but that seems to be the consensus within HN and the arguments of foul play (see the countless posts asking if there exists code that specifies google: "if string contains 'google' then..."). I'd like to see a test where users entered a pathological string in the toolbar with an alternative search provider specified (EDIT: not bing or google), clicked on a low ranking result (low ranking, or not appearing at all on bing for those keywords), and see if it pops up higher on bing at a later time.


after reading the first few paragraphs, i closed the tab.. a waste of space. all PR crap.


It isn't clear whether Bing was taking special interest in click data off of Google searches, or giving it special weight in its own results.

Google seemed to think so, but even if Microsoft was treating the data they received as neutrally as possible, it doesn't change the request made of them. Google wants an exemption.


Going by what MS says, its clear why Google's experiment worked. They came up with very unusual queries for which by definition there will be no data to generate search results. So out of the 1000 inputs only the one input from Google is present and hence the output.


Yes, but then this also means that for every term in Bing's index, "how it ranks on Google" is 1 of the 1000 factors. It takes a very obscure term to show it, but we shouldn't conclude from that, it happens only for obscure terms.

BING!! uses GoogleRank as a ranking factor. Perhaps they even adjust the weights of this factor, according to what the press/blogosphere is writing about Google's index quality...


What I'm interested in, does the Google Toolbar send back the click stream when you use Bing?


Feel like experimenting?


All I can say is, that google engineers must be really jealous for not coming up with the same idea of ranking sites based on what people enter into search boxes and what links they click on. Sounds like Pagerank 2.0 to me.


Yes, but from what other relevant search engine would them copy that?

ps: FYI, Google does record the clickstream within their own search results.


There were AI algorithms in 1999 that described doing exactly this. Pagerank is perfect for the web. Agreed, tracking people like ants, the links they click on as pheromone trails, makes for a great program: 90% of users visiting website A about pancakes, went to site B about pancake recipes. The problem lies in tracking your users to this extend. On the web this isn't possible, unless you control vast web estate and are unscrupulous about using this data. I bet most Google engineers know about swarm intelligence computing, and aren't jealous at engineers like: http://www.youtube.com/watch?v=7GM4Lt5k24s who don't Bing! loud enough.


Great response. I wouldn't have used the term "click fraud," but the analogy is easy to grasp (ie, that the "test" purposely attempts to strengthen an association between things that are not associated.)


How is that a great response? It does not address the allegations at all, just claims the allegations are insulting.

The only great thing I see is that the blog post actually confirms what Google said. Because otherwise, it would have been easy for Microsoft to put the finger at the point where Google is wrong.


I think this response is convincing. Period. Full stop.


Why would anyone upvote a completely ambiguous comment (what is "this response")? You thought the ambiguity was intentional?


Were you also confused by the previous comment when it asked "How is that a great response?"


"How is that" clearly refers to Bing's response. I'll admit that I interpreted your post "I think this response is convincing" as referring to lysium's response to your post. It may have been clearer if you had written 'the response', or more directly, 'Microsoft/Bing's response'.


Maybe jpwagner was just joking from the very beginning and we didn't get it?


Bing seems to want me to sign in to read this. Am I doing something wrong?


It would be interesting to compare the search results of a million or so keywords between Google and Bing using their APIs. I would like to see some numbers behind the overlap.


This doesn't seem like a terribly convincing response from Bing. The "honeypot" test seems like a pretty conclusive result to me.


'Imitation is the sincerest form of flattery.' But, 'Flattery is the worst and falsest way of showing our esteem.' Jonathan Swift


The article is titled "Setting the record straight", but it is not fair to say that "Bing sets the record straight"


Google, Google. Stop crying. Just continue copying Bing. Infinite scroll images. And large image backgrounds on the search page. Copy the Maps features, too.


I boggle that people think infinite scroll came from (X). People have been hacking infinite scroll onto pages that don't support it for years, frequently through Greasemonkey scripts and there are a few very powerful extensions for essentially every browser out there. If you want to claim Google copied it from Bing, Bing copied it from creative users. I highly doubt there has been any significant amount (if any at all) of UI/UX original-creativity on any large website before someone else made a quick demo on their personal site or hacked it on top of existing sites through browsing tools.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: