Hacker News new | past | comments | ask | show | jobs | submit login
Facebook employee responds on robots.txt controversy (petewarden.typepad.com)
54 points by petewarden on June 17, 2010 | hide | past | favorite | 39 comments



This is Bret Taylor, CTO of Facebook.

There are a couple of things I want to clarify. First, we genuinely support data portability: we want users to be able to use their data in other applications without restriction. Our new data policies, which we deployed at f8, clearly reflect this (http://developers.facebook.com/policy/):

    "Users give you their basic account information when they connect with your application. For all other data, you must obtain explicit consent from the user who provided the data to us before using it for any purpose other than displaying it back to the user on your application."
Basically, users have complete control over their data, and as long as user gives an application explicit consent, Facebook doesn't get in the way of the user using their data in your applications beyond basic protections like selling data to ad networks and other sleazy data collectors.

Crawling is a bit of special case. We have a privacy control enabling users to decide whether they want their profile page to show up in search engines. Many of the other "crawlers" don't really meet user expectations. As Blake mentioned in his response on Pete's blog post, some sleazy crawlers simply aggregate user data en masse and then sell it, which we view as a threat to user privacy.

Pete's post did bring up some real issues with the way we were handling things. In particular, I think it was bad for us to stray from Internet standards and conventions by having an robots.txt that was open and a separate agreement with additional restrictions. This was just a lapse of judgment.

We are updating our robots.txt to explicitly allow the crawlers of search engines that we currently allow to index Facebook content and disallow all other crawlers. We will whitelist crawlers when legitimate companies contact us who want to crawl us (presumably search engines). For other purposes, we really want people using our API because it has explicit controls around privacy and has important additional requirements that we feel are important when a company is using users' data from Facebook (e.g., we require that you have a privacy policy and offer users the ability to delete their data from your service).

This robots.txt change should be deployed today. The change will make our robots.txt abide by conventions and standards, which I think is the main legitimate complaint in Pete's post.


Thanks Bret, I appreciate you taking the time to dig into this. You're right, my complaint is the disconnect between the terms-of-service and robots.txt, so I'm very happy you're addressing this.

I just wish we could have had this conversation a few months ago, before you guys threw your lawyers at me.


And this, all, is why Bret Taylor is a huge asset for Facebook. He believes in open and trust with users and will push it everywhere.


It would actually be interesting to have a browser plug-in that shows you if a page you are on is excluded in the robots.txt file or which agents are specifically excluded/allowed to scan the page. Might make for some interesting viewing on sites like Facebook.


Yes, I've been thinking for a while about building an extension that shows you stuff like:

IP, location, host of the site you are viewing whois details for the site caching rules robots txt rules for the page page rank blah, etc

Probably as a dropdown info panel.

I'm surprised if this doesn't already exist, though.


Getting a response back from an executive is super-awesome.

I'm a little concerned about how we (80legs) fit into Facebook's permission form, though. We're not using or selling the data ourselves, but we can sell access to an already-setup web crawl - so how does that work? We're like a special case of a special case. Never an enviable spot to be in.


Please post contact information for bots wanting to be whitelisted. (Either here or in the robots.txt file itself).

For example, our ChangeDetection.com site has a couple hundred users monitoring pages on facebook. (User-Agent: ChangeDetection). We always honor robots.txt so if we are not whitelisted all these monitors will be disabled.

Edit: looks like you already have posted contact info in robots.txt


thanks for taking the time to respond to this. i definitely understand where you're coming from and am glad to hear the robots.txt is being updated.


Privacy is important, but data control (in a restrictive sense) is not the same as data portability.


Bret wrote: "I think it was bad for us to stray from Internet standards and conventions by having an robots.txt that was open and a separate agreement with additional restrictions."

You don't have an "agreement" at all. An agreement requires that two parties actually, um, agree. You've published a statement where you assert certain rights and imply that you will sue anyone who accesses data on your site in a way you don't like. You may get away with that, regardless of the legal merits of your position, because you have more money for lawyers than most people you're likely to sue. But don't try to dignify what you're doing by calling it an "agreement". It's like an extortionist telling me that we have an agreement that he won't break my windows if I pay him protection money.


If you feel so strongly about Facebook's "agreement", perhaps you shouldn't use Facebook.


In my personal opinion, the Facebook employee's response shows that he doesn't understand the company's high-level business goals. Their developers may think they are building an open web, but it's very clear the company is only interested in a closed web they control.

Based on my interaction with website API developers, most of them honestly believe that they are building an open web. But let's face it, the API is a benefit to them and creates a critical dependency for the user of the API. It's basically vendor lock-in.


As far as I know, any data that people post to Facebook without restricted visibility can be pulled through the API. If a user has logged into your site and given you access to their data, you can access restricted data as well. You can also write to Facebook via the API.

I'm not sure what you want Facebook to change.


Pete's right-on about this: "You've chosen to leave all that information out in the open so you can benefit from the search traffic, and instead try to change the established rules of the web so you can selectively sue anyone you decide is a threat. "

Speaking as someone who's working on leveraging the Facebook API in a commercial product, this leaves me feeling like I'm opening myself to a lot of legal exposure if Facebook subjectively decides that my service poses even a minor threat to them. Given that I'm bootstrapping, there's no way I'd be able to put up any sort of legal fight what so ever against a company as well-funded as Facebook.


I think that Facebook has a right to protect their data, but leaving robots.txt wide open and trying to enforce some random TOS that a developer might not even know about is the wrong way to do it. I think a much better method would be to ban all bots from robots.txt (and crawling) except the major search engines and provide data to others through APIs or other means that are controlled (and the TOS can apply to the API rather than random crawling). Maybe not what a developer wants to hear, but the truth is that the internet is infested with bots that do nefarious things, and I think it's better for the privacy and security of their userbase to control the spread of their data as much as possible.


And I absolutely agree with you - it's the subjectivity in their choice of whether or not something actually violates their policies which gives me pause. If you can't figure out in good faith if you're actually complying with their rules or not then that's a problem for developers, myself and Pete included.


Robots.txt governs automated retrieval of a page's contents. I don't think it makes any comments about what the retriever does with that data. Those uses are governed by other sets of rules. For instance, just because a site's robots.txt invites automated retrieval doesn't make it a legitimate use to just re-host that content elsewhere as your own.

I think Facebook's ToS is an appropriate way for them to send messages about what they'll do if you make certain uses of the content you retrieve from their site. However, these ToS documents aren't omnipotent: they can't restrict some fair uses to which you might put the data, for instance, or bind you to silly terms. IANAL, but I think the ToS is probably best understood as an intent to use still other sets of rules (perhaps selectively) if you do certain things with retrieved data. Disagreeing with that policy is certainly possible, but I don't think robots.txt has much to do with it.


They could also make the pages NOT indexable and then have a side agreement with Google to give Google API access to index the pages they want.


What? So you want to just make Google more powerful? How about all the other search engines, even the less known ones like Duck Duck Go?


I am unfamiliar with the details of a robots.txt file. Is it possible to specify "I only want bots x, y and z to crawl my site"?



How difficult is it for a bot to lie?

Though if you've made it clear that only x, y and z can crawl your site, and someone spoofs, say, y, then it would be easy to demonstrate that someone has done something they know they shouldn't.


incredibly easy.

and not only can the bot lie, it can disregard the robots.txt file altogether. just like the terms of service document for humans, you can choose to disregard it & deal w/ the consequences (blocked IP's, lawsuit, etc).

robots.txt is just a version of the TOS that computers can read.


"Facebook has always been a closed system where developers are expected to live in a culture of asking permission before doing anything, and existing at the whim of the company's management. [...] The web I love is an open world where you are free to innovate as long as you stick to the mutually agreed rules."

Facebook's policies aside, it's interesting to note that this criticism was written in response to a letter from Blake Ross- the founder of Firefox.


this is a really important issue.

this new model that facebook is trying to push isn't scalable. it favors the big guys and it's bad for the open web.

you shouldn't have a TOS that contradicts your robots.txt. period.


Sorry, as implied to my comment on the first story about this posted to Hacker News, I'm siding with Facebook on this one.

Of course they tune their page so that search crawlers can best index the information, so does everyone else on the web.

However, they also provide an API, with clearly defined terms of use, which you may use to get information. Basically your complaint boils down to you don't want to use the methods they've set up for you to access their data, and you're complaining about it.

As for comments about fair, true spirit of the internet what have you... I don't think the true spirit of the internet has ever been everyone has to give everything away in every possible imaginable way. And, in the end, it is Facebook's data, they make that clear before you ever start adding data to their servers. So does just about everyone else who allows you to submit data to their servers.

This shouldn't be confused with how difficult it is to remove data and accounts from their system, which is a giant pain in the butt, or the fact that they've made drastic changes to the public nature of the data after keeping it private for so long. That's all just a giant mess.


If I request content without circumventing any access controls and you give that content to me, I've done nothing wrong. I didn't accept any terms of service to do that, and even if I did, what is the legal recourse for someone breaking the terms of service? I always assumed it was termination of service, but that's probably wrong.

Suing people because you're too lazy to only give your content to people who you want to receive it is wrong.


What will happen is that they will send a cease and desist letter to you (like they did to Pete). At that point, you have basically given notice and it becomes much riskier to continue doing what you're doing.

That's not to say their C&D is valid or enforceable. It just enters into another legal bin that's a bit more hairy.

Unfortunately, the legal precedent for all this is murky at best. Past cases have been very specific to the details of those cases. No line has ever been drawn about what is ok and what is not.


> That's not to say their C&D is valid or enforceable. It just enters into another legal bin that's a bit more hairy.

Exactly. It will be interesting to see where things end up playing out in the "leaving yourself wide open" department.

It's very much parallel with the "using open wifi points" debate.


I think it's valid to consider robots.txt to be a part of a site's API.


It's not Facebook's data. It's the users' data.


I am just logically arguing here, not in support for anyone. But hasn't Facebook worked "hard" to setup the infrastructure to "collect" users' data? If you want users' data, go ask the users; not via scraping Facebook.


but their robots.txt says that you're welcome to scrape it:

  User-agent: *
  Disallow: /ac.php
  Disallow: /ae.php
  Disallow: /album.php
  Disallow: /ap.php
  Disallow: /feeds/
  Disallow: /o.php
  Disallow: /p.php
  Disallow: /photo_comments.php
  Disallow: /photo_search.php
  Disallow: /photos.php


> it is Facebook's data, they make that clear before you ever start adding data to their servers.

Not really true, their policy clearly says:

"You own all of the content and information you post on Facebook"

http://www.facebook.com/terms.php


"However, they also provide an API, with clearly defined terms of use, ..."

But are those terms an enforceable contract? Probably not. Merely making available a document that purports to be a contract does not make it so.


Facebook isn't the only site with that language. In fact I think you'll have to look hard to find any major site that doesn't have it. Not that it doesn't suck.


this is essentially the same argument people use against the iOS platform.

But clearly from Apple's example it is possible to build a profitable product on a closed system.

Admittedly it doesn't seem "fair" or true to the spirit of the internet. but it shouldn't be any surprise that a company is going to do what is best for that company forsaking all others.


I'm tired of hearing people complain about private companies' rules for using their information/platforms. Facebook and Apple can do whatever they want on their own platforms and if people don't like it they can leave/not use the service. That's how a free market and free web works. People complained about Microsoft's dominance years ago and now their ascendancy is ending due to this idea called the "market"...


The issue that a lot of people have is that they would like to see change happen in their lifetime. Microsoft is in a decline now, but that doesn't mean that: 1) it will continue to decline or 2) it will completely fail within our lifetimes. Once companies hit a certain level, they take a long time to completely fail due to the accumulated money and power during their 'good years.'

Change, in general, doesn't come about by people sitting around doing nothing but saying, 'the market will decide.' If you take the time to become an advocate for a certain 'side,' you can affect the market through public perception.

[aside: People are not always logical agents with access to full information like most economic theories seem to assume. To me this comes off like a lot of physics where things are assumed to be on a 'perfectly frictionless surface,' 'in a vaccum', etc.]




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: