Hacker News new | past | comments | ask | show | jobs | submit login
What one may find in robots.txt (xn--thibaud-dya.fr)
393 points by cellover on May 18, 2015 | hide | past | favorite | 122 comments



I needed some real-world MS Word & Excel documents years ago to test some parsing code against. So I started crawling Google results for 'filetype:.doc .xls' style queries. Left it running for a weekend, then the whole Monday was wasted as I was sucked into looking through the results - some stuff in there was certainly not meant for public disclosure...


Did you really crawl Google? That has to be a long time ago. But speaking about searching on Google as a user:

Google's Advanced search used to be a great tool, until around 2007/08. For some reason it never received an upgrade and several things are broken or don't work any more or were removed (e.g. '+' which is now a keyword for Google+, the '"' does mean the same; e.g some filetypes are blocked, some show only a few results).


Google never had an advanced search that was as useful as altavista where I regularly (daily!) searched for things like:

("term1" and "term2") not ("term3" or "term4")

and whatever tiny "power user" features that google had, like "allinsite:term term term" or '+' don't seem to work at all now.

Google is not optimized for finding things. Google is optimized for ad views and clicks.


so you mean like "term1" "term2" -"term3" -"term4"? or if I wanted to do this without returning results from hackernews "term1" ... -"term4" -site:news.ycombinator.com ?

I don't see how altavista is superior here


The problem is "whatever tiny 'power user' features that google had... don't seem to work at all now."

I think I know what they were talking about. A lot of times it appears that adding advanced terms to a query will change the estimated number of results yet all the top hits will be exactly the same. Also, punctuation seems to be largely ignored, e.g. searching "etc apt sources list" and "/etc/apt/sources.list" both give me the exact same results. Putting the filename in quotes also gives the same results as before.

Searching for specific error messages with more than a few key words or a filename is usually a nightmare.


This is all true. I do wish there were a flag you could set like searchp:"/etc/apt/sources.list"


The original Google has gone forever.

But then, so has the www that the original Google worked so well for.

I suspect original Google would be horrible on today's web.

I miss 1998 and I mourn for what could have been.


Is there any truth to my suspicion that the web of hyperlinks (on which the famed algorithm relied) is significantly weaker and reaches fewer corners these days?

Certainly feels like content is migrating to the walled gardens and there are fewer and fewer personal websites injecting edges into the open graph.


Last November I speculated why Google would let HTTP/2 get standardized without specifying the use of SRV records:

This is going to bite them big time in the end, because Google got large by indexing the Geocities-style web, where everybody did have their own web page on a very distributed set of web hosts. What Google is doing is only contributing to the centralization of the Web, the conversion of the Web into Facebook, which will, in turn, kill Google, since they then will have nothing to index.

They sort of saw this coming, but their idea of a fix was Google+ – trying to make sure that they were the ones on top. I think they are still hoping for this, which is why they won’t allow a decentralized web by using SRV records in HTTP/2.

https://news.ycombinator.com/item?id=8550133


It feels the same way to me. To a large percent of users the internet is Facebook rather than the largest compendium of human knowledge in existence, but lucky for us that use it for the latter reason that value of such a thing will always be evident.


Come to think of it, we moved offices 3 times since then, must've been 8-10 years ago. I don't think I had to do any special trickery, I spend only an afternoon or so writing and testing the code. I didn't realize such a thing would be impossible now - what a shame. I downloaded several gigabytes iirc - a big amount at the time.


Though now a day you could use Common Crawl to get the dataset and use existing tools to extract such files, right? (I've no idea if that's a practical thing to do or not.)


I guess so, if they "look" at the web the same way Google does (respecting robots.txt, nofollow etc - which Wikipedia says they do). But the interesting things are found in nooks and crannies where nobody else has thought of looking before - so relying on someone else to do the heavy lifting is probably the wrong way to go about it...


Common crawl gives you the data, not the results for the keywords that you're interested in.


Pardon me for asking, but how did you crawl Google for a whole weekend. From what I know, Google blocks you if you request too many search queries in a short period of time. Did you use proxies?


Maybe now, I don't remember I had do anything sneaky back then (8-10 years ago)


In 2004 I attempted to do some automatic crawling of Google for my masters thesis and was astonished to get an unfriendly server response saying it was disallowed and "don't even bother asking for an exception for research, it won't be granted."

So at least 11 years ago it was blocked.

(I didn't know about spoofing a user agent back then, so it might not have been as easy as that to get around it.)


> some stuff in there was certainly not meant for public disclosure...

Care to say more? I'm not asking about specifics (but won't mind reading about them either), I'm just curious about the type of information. Internal documents of corporations? .mil? Something else?


Meh, nothing that would make the newspapers, not even good enough remembering details about 10 years later - sorry to get your hopes up :) Internal company meeting notes, reports on legal proceedings, I remember some correspondence on a divorce, checklists and procedures for business processes, test grades and student notes from schools, that sort of stuff. Excel sheets with budgets and timekeeping sheets, just typical office stuff - no .mil or at least nothing interesting, I would have remembered if it had been anything juicy. Maybe Google results didn't show those, don't know.

It's not rocket science - go to Google.com, type "[some keyword] filetype:doc" (or pds, or xls, whatever), skip to page 10 or so, then start looking. I had to click just twice before I found (5 minutes ago) meeting minutes from some British council meeting, marked 'confidential and not for distribution', complete with names and dates/times. Go to your 'private files' folder on your machine (you know, where you store your job applications, that invitation you made once for your sister's birthday party, that sort of junk), look at the documents you'd least want somebody else to see, identify some keywords that are in those documents but not usually in others, use those as keywords in Google.

Here's a fun one: "site:gallery.mailchimp.com filetype:pdf coupon". Nothing shocking, but I still don't think Mailchimp's customers expect their email list attachments to show up like this... (disclaimer: pure speculation, maybe these are meant to be publicly available, or maybe they're not even real customers and just test data, I read it for the articles, etc.)


Yes OP, make us laugh with ancient stories of the ".doc" web! You've said to much when you mentioned "some documents"...


Great. There goes any chance of me actually doing work this week. ;p


I find this submission of this article interesting because it underscores inconsistent handling of I18N/punycode domains. The domain is "thiébaud.fr". Should submission sites (like HN) show the sites in ASCII? Is there a fraud risk? Should the web browser show the domain in ASCII?

For me - at no point was I shown the domain decoded to ASCII (either on HN or in the browser). I recognized the pattern and decoded it manually. For users who are not technical - this is a failed experience because the domain looks suspicious and at no point was it decoded.

I wonder when punycode decoding will begin to get attention from developers. Last year's Google IO had a great talk about how Google realized the inconsistency of their domain handling with regard to I18N:

https://www.google.com/events/io/schedule/session/22ce27dc-7...


It's a vector for putting in disruptive utf-8 characters, such as a huge stack of accents, or spoofing a reputable domain. It's not clear yet that the benefits to HN outweigh the risks. But if we start seeing a lot of quality content from domains that look better with punycode decoded, it'll be considered.


I'm on firefox, and it showed me the decoded domain.


Also on FF, 37.0.2 Linux x64 build, and I see `xn--thibaud-dya.fr` on HN but `thiébaud.fr` in the status/address bars.


38.0.1 Linux x64


It's actually quite different how all the browsers handle the Punycode domains in different places: http://blog.dubbelboer.com/2015/05/10/unicode-domain-support...


User-agent: *

Allow: /

# A robot may not injure a human being or through inaction allow a human being to come to harm.

# A robot must obey the orders given it by human beings, except where such orders would conflict with the First Law

# A robot must protect its own existence, as long as such protection does not conflict with the First or Second Laws.


Hm, the article several times says "required not to be indexed", while robots.txt is more like "request not to be crawled". An important distinction, because a page may well be indexed without being crawled, typically when there is a link to it. Better to use the noindex meta tag (at least if you are only concerned with search indexes, not access control).


Yes. This need to be emphasized: Disallowing URLs in robots.txt will not necessarily exclude them from search results. Search engines will still find those pages if they are linked to or mentioned somewhere else.

The search result will consist only of the URL, and the snippet will say "A description for this result is not available because of this site's robots.txt"

Use the noindex tag, folks. Also, Google Webmaster Tools allows you to remove URLs from Google's index.

Ref.: https://yoast.com/prevent-site-being-indexed/


> "At worse, only the internet etiquette has been breached."

Proceeds to announce the name of a stalking victim. Classy.


And he's also wrong. I assume he's French, and French law specifically says that any unwanted access to restricted parts of a "data processing system" is punishable by a hefty fine and/or prison time. For example, I think bug/vul bounties programs are illegal in France (but IANAL).

A relatively famous French blogger was convicted once in trial for having publicly reported that a governemental agency was letting GoogleBot index confidential/restricted documents : http://bluetouff.com/2013/04/25/la-non-affaire-bluetouff-vs-... [FR] http://arstechnica.com/tech-policy/2014/02/french-journalist... [EN].


Robots.txt is supposed to be public. Its purpose is to recommend not to visit some links (it's only a recommendation). And he did not visit the link itself, he just read the comments (made for humans to read) in a public file, I don't think there is a problem here.


How exactly is robots.txt a "restricted part"?


I believe the intent was that if robots.txt claims to disallow a section of a website, then French law would say that section is off-limits to visitors and visiting disallowed sections is punishable by law.

I doubt that this is a valid interpretation of the law. The robots.txt file simply mentions parts of the site that shouldn't be indexed, not which parts of a site shouldn't be viewed by the general public. You use a robots.txt file to prevent a search engine from following links that would enumerate all the possible dynamically-generated content on your site to conserve resources and to prevent junk results appearing for a search user.

Tangentially relevant: when you do have something you want indexed, it should probably be a static page that lives at a permanent URL. But never attempt to use a robots.txt file to "hide" sensitive data.


Off limits to robots, hence the name.

edit: I think you edited to clarify since I commented. Thanks.


I believe that only services hosted on the French ground are eligible to the French law.

Also, showing the link but not the content is a smart move because it doesn't prove that the author looked for the content of these documents, when the robots.txt is obviously a file you should be able to consult.


French law would then be different than German law, where the ground the intended audience is on matters. Of course it is much harder to prosecute someone, when the server is in a country that does not cooperate.


Strongly agree. I emailed the school to let them know about the error on their part. How hard would it have been for the author to do the same, and to change the name of the victim and institution to protect her?


It'a name that's in a plain text document that anybody can personally look up on its browser. What's the point of hiding it, if it's a very good point for his article?


He could have made exactly the same point with a fake name and institution. The information became public due to incompetence, he's made it much more visible… I don't care to speculate on what personal failing might be the reason.


It won't be exactly the same point.

With a fake name there is no proof that the information in question ever existed.

I would've probably masked a name, but institution url should stay there, so anyone could check that the point is valid.


Okay, that's fair, but it would be the same point for practical purposes, in my opinion. I don't personally believe that it's necessary for people to get hand-holding to independently verify his individual claims. Certainly other people could copy his methods to reproduce the results, but I'm not sure what the benefit is of links into the individual leaks of sensitive information.


The exact same point can be made by redacting or anonymizing the individual's name. Perhaps the organization rethinks its use of the comment and removes it. However, this blog post will live on through the archives of the internet in perpetuity, forever labeling that individual and, unfortunately, helps out their stalker.


The point would be just as well made if he did not list the actual name.


Now it's indexed by search engines?


https://www.google.com/search?q=inurl%3Arobots.txt

Not that I agree it should be further divulged, mind you.


Google for "Lisa Wilberg esc" (just like what a stalker might do to narrow it down to the correct uni) and you get the robots.txt file itself in 3rd place. The author isn't doing any more harm than Google already is. It's already easily searchable.


The author is bringing more unwanted attention to a victim. Why not contact the school and let them know of their error, or change the woman's name? Reprinting her name and the institution name accomplishes nothing good.


According to the date it would be almost 2 years ago, not exactly a current event...

There's plenty of others with the same name as her (58k according to Google), that file is now 404, and archive.org doesn't have it (due to its mention in the robots.txt), so whatever information in it that could've been useful for a stalker is long gone.


Signed up real quick to post this...

Her name shows up in a news letter, inc middle name, published and indexed by the school in 2012. It lists her occupation (a very public one) and even where you can find her works. This alone would be enough to get in touch with her.

It was pure morbid curiosity that lead me to search for it, but its totally still relevant if you were looking for her. Unfortunately, given her occupation, its unlikely that anything short of the stalker giving up or getting arrested will grant her any sort of reprieve.


Does it matter that the event was 2 years ago? I'm sure the victim wants to move on and put this well behind her.


I've had a stalker track me for well over 2 years.

That said, he only reveals that she was stalked which I presume the stalker and her are quite aware of, but no more than that. A better way to report the finding would have been to keep her name and the name of the institution out of it.


You also get this one and other online discussions about the article. The author has shown his lack of empathy and judgement by putting somebody on the spotlight for no good reason whatsoever.


I can't comment on the particulars, because the document doesn't mention this name (maybe it changed since it was originally posted?), but, whatever you feel about the justification for the original article, it seems clear that nothing is served by your re-posting the information.

At the moment, for me, your post is the only hit on Google for the query you suggest.


When I do that I get four relevant results. First is the robots.txt. Esc could presumably fix that were they told of their error. The second is the article, the third is your comment, and the fourth is a reddit post. The reddit one has recently been redacted (but is still in google cache).


I doubt the stalker has forgotten it.


Is it possible that has been removed (in which case, kudos to the author)? I didn't see a name in the document, and the name that Lorento mentions doesn't appear.


I like to put a Disallow rule to a randomly named directory with an index.php file that blocks any IP that accesses it. Then, for a bit of added fun, I put another Disallow rule to a directory named "spamtrap" which does an .htaccess redirection to the block script on the randomly named directory.


You understand that all you are doing with that is giving people possibly looking for attack vectors an attack vector, right? All an evildoer has to do is embed the blacklist directory as an image somewhere, send it to someone they want to lock out of the service, etc.


It wouldn't be too hard to prevent that. eg the honeypot directory name could be a hash of the originating IP. You'd need to then have a dynamic robots.txt but that's easily done.

The destination directory doesn't even need to exist. Worst case scenario you could handly the hash via your 404 handler or via .htaccess file if all of your hashes are prefixed. Those are only examples though - there's a multidude of ways you could handle the incoming request.


And then our hypothetical attacker can figure out how you generate the honeypot URL and embed an image with that URL for their visitors. Of course you can easily make it impossible to guess but you need to make sure that they can't obtain robots.txt via GET requests from a visitors browser (No Access-Control-Allow headers). Also don't forget a single visit from a network, sharing an IP would still ban all the network. And there is the rule that a GET request should not change any state. Banning an IP address is a change of state.

In short, it's just too much pain for little gain.

Maybe a login form can be served from that URL, and any attempts to login would then get the visitor banned via a session cookie / browser fingerprint combo (Easy to get around but at least then you're not blocking IP addresses).


> And then our hypothetical attacker can figure out how you generate the honeypot URL and embed an image with that URL for their visitors.

So add a salt. Just like you would when hashing passwords. You then make it more time consuming to crack the hash than it would be to perform a more typical denial of service attack

> you need to make sure that they can't obtain robots.txt via GET requests from a visitors browser (No Access-Control-Allow headers).

If an attacker already has control over the victims browser (to pull the robots.txt file) then they really don't need to bother with this attack.

> Also don't forget a single visit from a network, sharing an IP would still ban all the network.

Multiple users of the same site behind the same NATing really isn't that common unless you're Google / Facebook / etc. And when you're talking about those kinds of volumes then you'd have intrusion detection systems and possibly other, more sophisticated, honeypots in place to capture this kind of stuff. Also some busier sites will have to comply with PCI data security standards (and similar such as the Gambling Commision audits) which will require regular vulnerability scans (and possibly pen tests as well - depending on the strictness of the standards / audit) which will hopefully highlight weaknesses without the need to blanket ban via entrapment. And in the extremely rare instances where someone is innocently caught out, it's only a temporary ban anyway.

> Maybe a login form can be served from that URL, and any attempts to login would then get the visitor banned via a session cookie / browser fingerprint combo (Easy to get around but at least then you're not blocking IP addresses).

You can do this same method of banning with the honeypot you're arguing against!

While I don't disagree with any of your points per se, I do think you're being a little over dramatic. :)


Well, I was trying to point to the fact that how a small and not-so-useful feature can cause some unnecessary headaches. I know that it can work, just not worth it in my opinion. However, on second thought, I guess this could be implemented using less time than we used to write these comments :)

The login-form method would be a bit less silly, I thought, because it can be a POST. But, well...


The real benefit of a login form is you can easily seperate the casual nosy surfer from someone more interested in hacking your site.


I'm with you on this, not for huge sites but if even 1% of sites used this behavior malicious bots would be forced to honor robots.txt


Just check the Referer header.


You can't trust the referrer header for anything security related.



Your citation does reitterate my point. I quote "However, checking the referer is considered to be a weaker from of CSRF protection."

The referrer header can be subject to all sorts of subtle edge cases such as switching between secure and unsecure content (or is it the other way around, I can't recall off hand?) which many broswers will then refuse to send a referrer header. So while checking the referrer might work most of the time, it's really not robust enough to be considered trustworthy for anything security related.


the referer header does not get set on copied and pasted urls


Since a csrf will come from a <img tag which is not on your site, you can check the header. So the only problem is if the user copy pastes the link from a website or clicks from some other application. Which decreases the chance of a substantial attack alot. If you are using http as laumars stated, this doesnt work. But i think this should be a pretty decent and easy to implement solution if you're using https.


While you have a very valid point, in my use case I just ignore it. I don't see why anyone would want to block anyone else from accessing a small independent label website. Or the personal blog of a friend of mine. Or the portfolio site of another friend who is a designer. (edit)Or what real harm could come from it if it happened.


I agree with you (on risk is acceptable) but I also believe you underestimate the pettiness of people. I'm sure there are people who believe the ind label or your friends are enemies (for whatever real or imagined reason). But the set of those people who also have technical ability or access to technical ability to enact this "revenge" is almost certainly empty.


What if I put a link to this URL and google bot will follow? You will block google bot?


Googlebot won't follow the link because it's listed in robots.txt.


Googlebot caches robots.txt for a very, very long time. If you disallow a directory it may take months for the entire googlebot fleet to start ignoring it. Google's official stance is that you should manage disallow directives through webmaster tools.


Yes, but it will index it.


Wonderful. So your users may occasionally be blocked due to you having a bit of fun. IP blocking is a terribly over broad way of stopping intruders.


If my users want to go where I ask them not to go when they have no reason at all to go there, their problem. There are no links anywhere to those dirs, except for the robots.txt. Also, the blocks lift after a couple days. Honestly, nobody ever complained, and I'm sure it stops/hampers some attacks dead on.


Users share IPs.


What do you mean 'with bots'? We're talking about anyone at all hitting a link blocking the IP they have.

Some mobile networks have every person on the network originating traffic from the same IP. Some large institutions, universities, government departments, large companies have all their traffic coming from one IP.

This person has effectively created a feature that will perform a denial of service attack on their own website.


The websites I manage aren't Facebook or Google sized. Or even HN sized. I don't see that as a real problem at all.


You see it not as problem, because user don't see you (your sites). As IPv4 are getting rare, many share the same IP. So it's really a bad practice to ban IP for a longer period.


with bots ? If an Ip does not respect the host rules, it deserves to be blocked.


For a start, the entire nation of Qatar shares 82.148.97.69.


And it's definitely not a good idea to run a bot from there.


That's actually not true. A bunch of people in Qatar use that IP though.


Good thing a whopping 0% of my expected userbase originates from Qatar.


robots.txt tells robots where not to go.


I don't understand why a user would access a randomly created directory that is only mentioned in the robots.txt file. Can you explain as to why you'd think they'd stumble across this?


Curiosity?


This is silly. If you have a user of your site going through your bots file and specifically going to directories listed as Disallow then you deal with that user. Blocking based on the robots.txt for a directory that doesn't exist anywhere but that file is fine. I did a two bad directory ban, it seemed to work fine.


Curiosity killed the cat, or so they say.


As long as the block lasts not more than a few days, as he says in a nearby comment, I don't think it's much of a problem.


Oh, believe me they only last a few days at most. I don't use an infinite IP blocklist. It has like 100 IP's or so. When one goes in, one must come out. And lets just say there are enough badly behaved bots around that it doesn't take much time for 100 IP's to rotate.


"But satisfaction brought him back."


Ditto. My favorite name is /porn and then see who visits it. Mostly bots, though.


What's with the domain name? it says thiébaud.fr but when copied/pasted it becomes xn--thibaud-dya.fr

Edit: Thanks for the links! :)


The domain name is being converted to PunyCode (https://en.wikipedia.org/wiki/Punycode). This is a defence developed to prevent spoofing of domain names using indistinguishable characters from other character sets than ASCII. Before this was fixed by browser developers, it was possible to execute an IDN homograph attack (https://en.wikipedia.org/wiki/IDN_homograph_attack) and spoof sites using characters that looked like e.g., google.com/paypal.com when they rendered in your browser address bar, but were actually a completely different domain as far as automated systems were concerned.

Edit: typo.



That's how internationalized domains work.

Wikipedia: https://en.wikipedia.org/wiki/Uniform_Resource_Locator#Inter...


Interestingly, Hacker News doesn't support this standard, and the link on the front page shows the unfriendly version.


It's because IDN domains became a standard after tables stopped to be used in page layouts </sarcasm>


I like:

  [...]

  User-agent: nsa
  Disallow: /
From slack.com/robots.txt


Wow, all the US Department of State files have just gone missing from archive.org. The servers hosting those files are conveniently down.


Check out the Internet Archive FAQ on how to remove a document from their archives. https://archive.org/about/exclude.php

It looks like they used robots.txt to do that.


Huh, so the wild-card user-agent will block not just searchbots, but also archivebots. Wonder how OP managed to get screenshots of archive.org having archives available for those documents.


They're there, at least the two I looked at.

https://web.archive.org/web/20130413152316/http://www.state....

Each line is missing `/documents` in the snippet of the `robots.txt`


I have been able to view multiple pdfs and view the page screenshotted by the author.


Regarding the Knesset website, it is actually just a boring recordings of the parliament discussions. Nothing to see here, move along... :-)


Now if someone could do an analysis of humans.txt, that would be cool.


You'd have to get a bot to do that, since us human's can't access the content it points to.


> xn--thibaud-dya.fr

Looks like HN needs to learn how to decode Punycode…


It's very difficult to render decoded punycode domains in a way that does not facilitate spoofing.


If anyone is curious what this punycode decodes to:

> http://thiébaud.fr


I wonder how could you gather a huge _domain name_ list.

I guess using DNS, or by query some engine, like google, or archive.org

is there a service somewhere?



Or ... you could use the methods listed in the article.


See also: https://www.premiumdrops.com/zones.html

They have some pretty good zone files for major TLDs.


Maybe the DNS ANY or reverse DNS datasets on scans.io? (the former, I presume, ultimately covering more domain names but containing more extraneous information)


You can apply for Zone File access at some Internet registries.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: