I needed some real-world MS Word & Excel documents years ago to test some parsing code against. So I started crawling Google results for 'filetype:.doc .xls' style queries. Left it running for a weekend, then the whole Monday was wasted as I was sucked into looking through the results - some stuff in there was certainly not meant for public disclosure...
Did you really crawl Google? That has to be a long time ago. But speaking about searching on Google as a user:
Google's Advanced search used to be a great tool, until around 2007/08. For some reason it never received an upgrade and several things are broken or don't work any more or were removed (e.g. '+' which is now a keyword for Google+, the '"' does mean the same; e.g some filetypes are blocked, some show only a few results).
so you mean like "term1" "term2" -"term3" -"term4"? or if I wanted to do this without returning results from hackernews "term1" ... -"term4" -site:news.ycombinator.com ?
The problem is "whatever tiny 'power user' features that google had... don't seem to work at all now."
I think I know what they were talking about. A lot of times it appears that adding advanced terms to a query will change the estimated number of results yet all the top hits will be exactly the same. Also, punctuation seems to be largely ignored, e.g. searching "etc apt sources list" and "/etc/apt/sources.list" both give me the exact same results. Putting the filename in quotes also gives the same results as before.
Searching for specific error messages with more than a few key words or a filename is usually a nightmare.
Is there any truth to my suspicion that the web of hyperlinks (on which the famed algorithm relied) is significantly weaker and reaches fewer corners these days?
Certainly feels like content is migrating to the walled gardens and there are fewer and fewer personal websites injecting edges into the open graph.
Last November I speculated why Google would let HTTP/2 get standardized without specifying the use of SRV records:
“This is going to bite them big time in the end, because Google got large by indexing the Geocities-style web, where everybody did have their own web page on a very distributed set of web hosts. What Google is doing is only contributing to the centralization of the Web, the conversion of the Web into Facebook, which will, in turn, kill Google, since they then will have nothing to index.
They sort of saw this coming, but their idea of a fix was Google+ – trying to make sure that they were the ones on top. I think they are still hoping for this, which is why they won’t allow a decentralized web by using SRV records in HTTP/2.”
It feels the same way to me. To a large percent of users the internet is Facebook rather than the largest compendium of human knowledge in existence, but lucky for us that use it for the latter reason that value of such a thing will always be evident.
Come to think of it, we moved offices 3 times since then, must've been 8-10 years ago. I don't think I had to do any special trickery, I spend only an afternoon or so writing and testing the code. I didn't realize such a thing would be impossible now - what a shame. I downloaded several gigabytes iirc - a big amount at the time.
Though now a day you could use Common Crawl to get the dataset and use existing tools to extract such files, right? (I've no idea if that's a practical thing to do or not.)
I guess so, if they "look" at the web the same way Google does (respecting robots.txt, nofollow etc - which Wikipedia says they do). But the interesting things are found in nooks and crannies where nobody else has thought of looking before - so relying on someone else to do the heavy lifting is probably the wrong way to go about it...
Pardon me for asking, but how did you crawl Google for a whole weekend. From what I know, Google blocks you if you request too many search queries in a short period of time. Did you use proxies?
In 2004 I attempted to do some automatic crawling of Google for my masters thesis and was astonished to get an unfriendly server response saying it was disallowed and "don't even bother asking for an exception for research, it won't be granted."
So at least 11 years ago it was blocked.
(I didn't know about spoofing a user agent back then, so it might not have been as easy as that to get around it.)
> some stuff in there was certainly not meant for public disclosure...
Care to say more? I'm not asking about specifics (but won't mind reading about them either), I'm just curious about the type of information. Internal documents of corporations? .mil? Something else?
Meh, nothing that would make the newspapers, not even good enough remembering details about 10 years later - sorry to get your hopes up :) Internal company meeting notes, reports on legal proceedings, I remember some correspondence on a divorce, checklists and procedures for business processes, test grades and student notes from schools, that sort of stuff. Excel sheets with budgets and timekeeping sheets, just typical office stuff - no .mil or at least nothing interesting, I would have remembered if it had been anything juicy. Maybe Google results didn't show those, don't know.
It's not rocket science - go to Google.com, type "[some keyword] filetype:doc" (or pds, or xls, whatever), skip to page 10 or so, then start looking. I had to click just twice before I found (5 minutes ago) meeting minutes from some British council meeting, marked 'confidential and not for distribution', complete with names and dates/times. Go to your 'private files' folder on your machine (you know, where you store your job applications, that invitation you made once for your sister's birthday party, that sort of junk), look at the documents you'd least want somebody else to see, identify some keywords that are in those documents but not usually in others, use those as keywords in Google.
Here's a fun one: "site:gallery.mailchimp.com filetype:pdf coupon". Nothing shocking, but I still don't think Mailchimp's customers expect their email list attachments to show up like this... (disclaimer: pure speculation, maybe these are meant to be publicly available, or maybe they're not even real customers and just test data, I read it for the articles, etc.)
I find this submission of this article interesting because it underscores inconsistent handling of I18N/punycode domains. The domain is "thiébaud.fr". Should submission sites (like HN) show the sites in ASCII? Is there a fraud risk? Should the web browser show the domain in ASCII?
For me - at no point was I shown the domain decoded to ASCII (either on HN or in the browser). I recognized the pattern and decoded it manually. For users who are not technical - this is a failed experience because the domain looks suspicious and at no point was it decoded.
I wonder when punycode decoding will begin to get attention from developers. Last year's Google IO had a great talk about how Google realized the inconsistency of their domain handling with regard to I18N:
It's a vector for putting in disruptive utf-8 characters, such as a huge stack of accents, or spoofing a reputable domain. It's not clear yet that the benefits to HN outweigh the risks. But if we start seeing a lot of quality content from domains that look better with punycode decoded, it'll be considered.
Hm, the article several times says "required not to be indexed", while robots.txt is more like "request not to be crawled".
An important distinction, because a page may well be indexed without being crawled, typically when there is a link to it. Better to use the noindex meta tag (at least if you are only concerned with search indexes, not access control).
Yes. This need to be emphasized:
Disallowing URLs in robots.txt will not necessarily exclude them from search results.
Search engines will still find those pages if they are linked to or mentioned somewhere else.
The search result will consist only of the URL, and the snippet will say "A description for this result is not available because of this site's robots.txt"
Use the noindex tag, folks. Also, Google Webmaster Tools allows you to remove URLs from Google's index.
And he's also wrong. I assume he's French, and French law specifically says that any unwanted access to restricted parts of a "data processing system" is punishable by a hefty fine and/or prison time. For example, I think bug/vul bounties programs are illegal in France (but IANAL).
Robots.txt is supposed to be public. Its purpose is to recommend not to visit some links (it's only a recommendation). And he did not visit the link itself, he just read the comments (made for humans to read) in a public file, I don't think there is a problem here.
I believe the intent was that if robots.txt claims to disallow a section of a website, then French law would say that section is off-limits to visitors and visiting disallowed sections is punishable by law.
I doubt that this is a valid interpretation of the law. The robots.txt file simply mentions parts of the site that shouldn't be indexed, not which parts of a site shouldn't be viewed by the general public. You use a robots.txt file to prevent a search engine from following links that would enumerate all the possible dynamically-generated content on your site to conserve resources and to prevent junk results appearing for a search user.
Tangentially relevant: when you do have something you want indexed, it should probably be a static page that lives at a permanent URL. But never attempt to use a robots.txt file to "hide" sensitive data.
I believe that only services hosted on the French ground are eligible to the French law.
Also, showing the link but not the content is a smart move because it doesn't prove that the author looked for the content of these documents, when the robots.txt is obviously a file you should be able to consult.
French law would then be different than German law, where the ground the intended audience is on matters. Of course it is much harder to prosecute someone, when the server is in a country that does not cooperate.
Strongly agree. I emailed the school to let them know about the error on their part. How hard would it have been for the author to do the same, and to change the name of the victim and institution to protect her?
It'a name that's in a plain text document that anybody can personally look up on its browser. What's the point of hiding it, if it's a very good point for his article?
He could have made exactly the same point with a fake name and institution. The information became public due to incompetence, he's made it much more visible… I don't care to speculate on what personal failing might be the reason.
Okay, that's fair, but it would be the same point for practical purposes, in my opinion. I don't personally believe that it's necessary for people to get hand-holding to independently verify his individual claims. Certainly other people could copy his methods to reproduce the results, but I'm not sure what the benefit is of links into the individual leaks of sensitive information.
The exact same point can be made by redacting or anonymizing the individual's name. Perhaps the organization rethinks its use of the comment and removes it. However, this blog post will live on through the archives of the internet in perpetuity, forever labeling that individual and, unfortunately, helps out their stalker.
Google for "Lisa Wilberg esc" (just like what a stalker might do to narrow it down to the correct uni) and you get the robots.txt file itself in 3rd place. The author isn't doing any more harm than Google already is. It's already easily searchable.
The author is bringing more unwanted attention to a victim. Why not contact the school and let them know of their error, or change the woman's name? Reprinting her name and the institution name accomplishes nothing good.
According to the date it would be almost 2 years ago, not exactly a current event...
There's plenty of others with the same name as her (58k according to Google), that file is now 404, and archive.org doesn't have it (due to its mention in the robots.txt), so whatever information in it that could've been useful for a stalker is long gone.
Her name shows up in a news letter, inc middle name, published and indexed by the school in 2012. It lists her occupation (a very public one) and even where you can find her works. This alone would be enough to get in touch with her.
It was pure morbid curiosity that lead me to search for it, but its totally still relevant if you were looking for her. Unfortunately, given her occupation, its unlikely that anything short of the stalker giving up or getting arrested will grant her any sort of reprieve.
I've had a stalker track me for well over 2 years.
That said, he only reveals that she was stalked which I presume the stalker and her are quite aware of, but no more than that. A better way to report the finding would have been to keep her name and the name of the institution out of it.
You also get this one and other online discussions about the article. The author has shown his lack of empathy and judgement by putting somebody on the spotlight for no good reason whatsoever.
I can't comment on the particulars, because the document doesn't mention this name (maybe it changed since it was originally posted?), but, whatever you feel about the justification for the original article, it seems clear that nothing is served by your re-posting the information.
At the moment, for me, your post is the only hit on Google for the query you suggest.
When I do that I get four relevant results. First is the robots.txt. Esc could presumably fix that were they told of their error. The second is the article, the third is your comment, and the fourth is a reddit post. The reddit one has recently been redacted (but is still in google cache).
Is it possible that has been removed (in which case, kudos to the author)? I didn't see a name in the document, and the name that Lorento mentions doesn't appear.
I like to put a Disallow rule to a randomly named directory with an index.php file that blocks any IP that accesses it.
Then, for a bit of added fun, I put another Disallow rule to a directory named "spamtrap" which does an .htaccess redirection to the block script on the randomly named directory.
You understand that all you are doing with that is giving people possibly looking for attack vectors an attack vector, right? All an evildoer has to do is embed the blacklist directory as an image somewhere, send it to someone they want to lock out of the service, etc.
It wouldn't be too hard to prevent that. eg the honeypot directory name could be a hash of the originating IP. You'd need to then have a dynamic robots.txt but that's easily done.
The destination directory doesn't even need to exist. Worst case scenario you could handly the hash via your 404 handler or via .htaccess file if all of your hashes are prefixed. Those are only examples though - there's a multidude of ways you could handle the incoming request.
And then our hypothetical attacker can figure out how you generate the honeypot URL and embed an image with that URL for their visitors. Of course you can easily make it impossible to guess but you need to make sure that they can't obtain robots.txt via GET requests from a visitors browser (No Access-Control-Allow headers). Also don't forget a single visit from a network, sharing an IP would still ban all the network. And there is the rule that a GET request should not change any state. Banning an IP address is a change of state.
In short, it's just too much pain for little gain.
Maybe a login form can be served from that URL, and any attempts to login would then get the visitor banned via a session cookie / browser fingerprint combo (Easy to get around but at least then you're not blocking IP addresses).
> And then our hypothetical attacker can figure out how you generate the honeypot URL and embed an image with that URL for their visitors.
So add a salt. Just like you would when hashing passwords. You then make it more time consuming to crack the hash than it would be to perform a more typical denial of service attack
> you need to make sure that they can't obtain robots.txt via GET requests from a visitors browser (No Access-Control-Allow headers).
If an attacker already has control over the victims browser (to pull the robots.txt file) then they really don't need to bother with this attack.
> Also don't forget a single visit from a network, sharing an IP would still ban all the network.
Multiple users of the same site behind the same NATing really isn't that common unless you're Google / Facebook / etc. And when you're talking about those kinds of volumes then you'd have intrusion detection systems and possibly other, more sophisticated, honeypots in place to capture this kind of stuff. Also some busier sites will have to comply with PCI data security standards (and similar such as the Gambling Commision audits) which will require regular vulnerability scans (and possibly pen tests as well - depending on the strictness of the standards / audit) which will hopefully highlight weaknesses without the need to blanket ban via entrapment. And in the extremely rare instances where someone is innocently caught out, it's only a temporary ban anyway.
> Maybe a login form can be served from that URL, and any attempts to login would then get the visitor banned via a session cookie / browser fingerprint combo (Easy to get around but at least then you're not blocking IP addresses).
You can do this same method of banning with the honeypot you're arguing against!
While I don't disagree with any of your points per se, I do think you're being a little over dramatic. :)
Well, I was trying to point to the fact that how a small and not-so-useful feature can cause some unnecessary headaches. I know that it can work, just not worth it in my opinion. However, on second thought, I guess this could be implemented using less time than we used to write these comments :)
The login-form method would be a bit less silly, I thought, because it can be a POST. But, well...
Your citation does reitterate my point. I quote "However, checking the referer is considered to be a weaker from of CSRF protection."
The referrer header can be subject to all sorts of subtle edge cases such as switching between secure and unsecure content (or is it the other way around, I can't recall off hand?) which many broswers will then refuse to send a referrer header. So while checking the referrer might work most of the time, it's really not robust enough to be considered trustworthy for anything security related.
Since a csrf will come from a <img tag which is not on your site, you can check the header. So the only problem is if the user copy pastes the link from a website or clicks from some other application. Which decreases the chance of a substantial attack alot. If you are using http as laumars stated, this doesnt work. But i think this should be a pretty decent and easy to implement solution if you're using https.
While you have a very valid point, in my use case I just ignore it. I don't see why anyone would want to block anyone else from accessing a small independent label website. Or the personal blog of a friend of mine. Or the portfolio site of another friend who is a designer. (edit)Or what real harm could come from it if it happened.
I agree with you (on risk is acceptable) but I also believe you underestimate the pettiness of people. I'm sure there are people who believe the ind label or your friends are enemies (for whatever real or imagined reason). But the set of those people who also have technical ability or access to technical ability to enact this "revenge" is almost certainly empty.
Googlebot caches robots.txt for a very, very long time. If you disallow a directory it may take months for the entire googlebot fleet to start ignoring it. Google's official stance is that you should manage disallow directives through webmaster tools.
If my users want to go where I ask them not to go when they have no reason at all to go there, their problem. There are no links anywhere to those dirs, except for the robots.txt. Also, the blocks lift after a couple days. Honestly, nobody ever complained, and I'm sure it stops/hampers some attacks dead on.
What do you mean 'with bots'? We're talking about anyone at all hitting a link blocking the IP they have.
Some mobile networks have every person on the network originating traffic from the same IP. Some large institutions, universities, government departments, large companies have all their traffic coming from one IP.
This person has effectively created a feature that will perform a denial of service attack on their own website.
You see it not as problem, because user don't see you (your sites). As IPv4 are getting rare, many share the same IP. So it's really a bad practice to ban IP for a longer period.
I don't understand why a user would access a randomly created directory that is only mentioned in the robots.txt file. Can you explain as to why you'd think they'd stumble across this?
This is silly. If you have a user of your site going through your bots file and specifically going to directories listed as Disallow then you deal with that user. Blocking based on the robots.txt for a directory that doesn't exist anywhere but that file is fine. I did a two bad directory ban, it seemed to work fine.
Oh, believe me they only last a few days at most. I don't use an infinite IP blocklist. It has like 100 IP's or so. When one goes in, one must come out. And lets just say there are enough badly behaved bots around that it doesn't take much time for 100 IP's to rotate.
The domain name is being converted to PunyCode (https://en.wikipedia.org/wiki/Punycode). This is a defence developed to prevent spoofing of domain names using indistinguishable characters from other character sets than ASCII. Before this was fixed by browser developers, it was possible to execute an IDN homograph attack (https://en.wikipedia.org/wiki/IDN_homograph_attack) and spoof sites using characters that looked like e.g., google.com/paypal.com when they rendered in your browser address bar, but were actually a completely different domain as far as automated systems were concerned.
Huh, so the wild-card user-agent will block not just searchbots, but also archivebots. Wonder how OP managed to get screenshots of archive.org having archives available for those documents.
Maybe the DNS ANY or reverse DNS datasets on scans.io? (the former, I presume, ultimately covering more domain names but containing more extraneous information)