Hacker News new | past | comments | ask | show | jobs | submit login
Google’s robots.txt parser is now open source (googleblog.com)
777 points by dankohn1 on July 1, 2019 | hide | past | favorite | 194 comments



Where was this 10 years ago when I was reverse engineering the Google robots.txt parser by feeding example robots.txt files and URLs into the Google webmaster tool? I actually went so far as to build a convoluted honeypot website and robots.txt to see what the Google crawler would do in the wild.

Having written the robots.txt parser at Blekko, I can tell you what standards there are incomplete and inconsistent.

Robots.txt files are usually written by hand using random text editors ("/n" vs "/r/n" vs a mix of both!) by people who have no idea what a programming language grammar is. Let alone follow BNF from the RFC. There are situations where adding a newline completely negates all your rules. Specifically, newlines between useragent lines nor between useragent lines and rules.

My first inclination was to build an RFC compliant parser and point to the standard if anyone complained. However, if you start looking at a cross section of robots.txt files, you see that very few are well formed.

With the addition of sitemaps, crawl-delay, and other non-standard syntax adopted by Google, Bing, and Yahoo (RIP). Clearly the RFC is just a starting point and what ends up on website can be broken and hard to interpret the author's meaning. For example, the Google parser allows for five possible spellings of DISALLOW, including DISALLAW.

If you read a few webmaster boards, you see that many website owners don't want a lesson in Backus–Naur form and are quick to get the torches and pitchforks if they feel some crawler is wasting their precious CPU cycles or cluttering up their log files. Having a robots.txt parser that "does what the webmaster intends" is critical. Sometimes, I couldn't figure out what some particular webmaster intended, let alone write a program that could. The only solution was to draft off of Google's de facto standard.

(To the webmaster with the broken robots.txt and links on every product page with a CGI arg with "&action=DELETE" in it, we're so sorry! but... why???)

Here's the Perl for the Blekko robots.txt parser. https://github.com/randomstring/ParseRobotsTXT


Accidentally deleting someone's entire website because they don't understand the difference between GET and POST requests is virtually a right of passage when writing a web crawler.


That's why it's good to crawl twice, so if the site got deleted by the first crawl, you can check and then discard the results. Saves a bit of disk space.


There's a similar problem with AV automatically unsubscribing all of your customers from their spam (and newsletters) the first time you start scanning emails for malicious links. It's a little less than completely solvable, in fact.


I'd say this is a happy solution, but that's as a target for these emails - not as a sender.


if their 'delete website' link is reachable anonymously and performs using a GET, they deserve the result.


I’m trying to find a link to it but there was an incident based on this issue somewhere around 1999-2001 where Microsoft added a sort of prefetching thing to IE (or was it Netscape?!) and it would effectively click all the links on the page in order to get all the content in the cache.

Lots of us really didn’t know what we were doing and we’d made all the action buttons in the listing screens regular links. As you can imagine, pandemonium ensued.

Hey, at least we’d figured out that sql injection was a thing.

It was a simpler time.


1997 we had this crazy notion of web "channels" (like RSS feeds), and offline viewing, where a client on a painfully slow dial-up connection could download and cache the resources required to display complex web pages.

Microsoft did this via an explicit web manifest; the web page author needed to list all of the resources they wanted to use in offline or pre-cache mode.

Netscape tried to do this by urging web authors to Be Very Careful with the links on a page, which usually required a specially-crafted offline-crawler-only version of the site. Predicably, hilarity ensued.

The term of art at the time was "push technology" or "web push", the irony of which was not lost upon those tasked with making it work.

1997. Good times.


While the timeframe and company are a little off, it sounds like you Google Web Accelerator (for IE and Firefox) in 2005.

https://web.archive.org/web/20050505061702/http://webacceler...

The issue you described seems to match this: https://signalvnoise.com/archives2/google_web_accelerator_he...


Does Google promise that its bot will never submit a POST even if it’s triggered by an <a onclick>?


Googlebot has no javascript



It's a masterpiece that Google convinced everyone to use their crawler engine as a browser.


It's an easy fix if Google cared. Have an online tool that validates if the robots.txt is correct, and send out an announcement that files that don't meet spec will be penalized in terms of SEO.


IIRC Google do automated checks on robots.txt and report in webmaster tools if you did something that looks crazy.

https://qph.fs.quoracdn.net/main-qimg-002d1f819e1bfbbd14fa2d... shows some of the interface, but I seem to recall getting notified in webmaster tools when I messed up the robots.txt on a particular site.


That just punishes users by sinking relevant results for reasons users couldn’t possibly care about.


I enjoy the hypocrisy of Google punishing sites that aren't "mobile-friendly," and then deliberately disabling ZOOMING on their own mobile sites.


Could also make the spec easier to be compliant.


hmm, you mean "\n" vs "\r\n"? ;-)


I've been in disagreements with SEO people quite frequently about a "Noindex" directive for robots.txt. There seem to be a bunch of articles that are sent to me every time I question its existence[0][1]. Google's own documentation says that noindex should be in the meta HTML but the SEO people seem to trust these shady sites more.

I haven't read through all of the code but it assuming this is actually what's running on Google's scrapers this section [2] seems to be pretty conclusive evidence to me that this Noindex thing is bullshit.

[0] https://www.deepcrawl.com/blog/best-practice/robots-txt-noin...

[1]https://www.stonetemple.com/does-google-respect-robots-txt-n...

[2] https://github.com/google/robotstxt/blob/59f3643d3a3ac88f613...


Google is also really generous with how they will let you spell "disallow": https://github.com/google/robotstxt/blob/59f3643d3a3ac88f613...

:D


I'm not surprised. Some people think humans read robots.txt and get super angry when the crawler doesn't understand.


I read robots.txt, but I'm not a massive corporation.


> (absl::StartsWithIgnoreCase(key, "disallaw")))));

Ah, the southern version. :)


This is great. #kAllowFrequentTypos


Ive never made a tpyo in my life!!!


Whats a tpyo? My typo array for "typo" only has tipo, typpo and thai-pho


If you're gonna include thai-pho you also need to include thai-fur...


Makes sense because not everyone speaks English as native language. Disalow is pretty close to disallow, phonetically.


Yuck though! Imagine if you were writing a compiler. Would you make it accept “unsinged” “unnsigned” “unssined” and “unsined” as keywords, just to catch spelling mistakes? Not sure I like that pattern.


It's a little different in that case, since the person using the parser is also the person writing the input to the parser. So if the input fails the parser, the author of the code can simply correct it. As I understand it, there's no single standard that captures how all robots.txt files are formatted, so there's no "standard parser" that the authors of these files could be expected to pass.


That is not an excuse. Non-native speakers can learn to spell.


Google has been very clear lately (via John Mueller) regarding getting pages indexed or removed from the index.

If you want to make sure a URL is not in their index then you have to 'allow' them to crawl the page in robots.txt and use a noindex meta tag on the page to stop indexing. Simply disallowing the page from being crawled in robots.txt will not keep it out of the index.

In fact, I've seen plenty of pages still rank well despite the page being disallowed in robots.txt. A great example of this is the keyword "backpack" in Google. You'll see the site doesn't want it indexed (it's disallowed in robots.txt) but the site still ranks well for a popular keyword).


That's correct. If a URL is blocked using robots.txt, Google will never be able to see the "noindex" tag on the page.

URLs blocked in robots.txt can get discovered through other links and they will get displayed in the search results.

However, you will not see any information like the meta description on these blocked URLs.

There's a good explanation about this here, including a video from former Googler, Matt Cutts: https://yoast.com/prevent-site-being-indexed/


> However, you will not see any information like the meta description on these blocked URLs.

True, but that's not the only thing. If it ever was in the index, it takes forever to be removed, if it gets removed at all. Send 404 or 410, Disallow it or set it to noindex - you may get lucky or you may not. You can of course "hide it from search results", but that only works for 90 days (iirc, may be 120, something in that range). Those leftovers will typically lose rankings, but they often stay indexed, easy to spot with a site: query.


Reindexing a page is dynamic based on noteworthiness and volatility iirc, but individual links can be reindexed on the fly since the Percolator index. The 90d number was from an old system when indexes were broken into shards that had to be swapped out wholesale.

Percolator white paper: https://ai.google/research/pubs/pub36726


I don't mean reindexing, I mean "hiding from the index" ("Remove URLs" in GSC). It works instantly, but only for a limited time, after which it will re-appear in the index if you haven't gotten it out of the index (via 410, noindex or disallow). Since these other ways don't always work, if you're unlucky and want it to stay gone, you need to hide it again (and again and again). I've had clients that were hacked and had spammy content injected into their site and it took (literally!) years for that to get removed (we tried combinations of 404, 410, noindex and disallow).


Yeah, the URL removal tool is not meant for permanent removals, but for temporary, 90-day removals:

https://support.google.com/webmasters/answer/1663419?hl=en


Exactly, there is no guaranteed way to remove anything, HTTP status, meta-tags, headers, and robots.txt only have advisory status. They are usually followed when a resource is hit first, but once it's in the index, "keeping the result available" seems to be a top priority. I do understand the idea - it might still be a useful result for a user, but otoh if it's 410 (or continuously 404), it won't be of any use because the content that was indexed is no longer available (especially in case of 410).

Granted, these are edge cases, in most circumstances, 410 + 90 day hiding means they are hidden instantly and don't resurface. These edge cases do make me take Google's official statements on how to deal with things with a grain of salt though: bugs exist, and unless you happen to know somebody at Google there's no way to report them.


Send 410 Gone with a noindex meta tag in html and X-Robots-Tag?

https://www.searchenginejournal.com/google-404-status/254429 "How Google Handles 404/410 Status Codes" -- "If we see a 410, they immediately convert that 410 into an error rather than protecting it for 24 hours"


> You'll see the site doesn't want it indexed (it's disallowed in robots.txt) but the site still ranks well for a popular keyword).

Which site? [Edit: I have now found https://www.gcsbackpack.com/ on page 6 of the results, and this was presumably the intended site.]


Doesn't that indicate that Google doesn't respect robots.txt then?


No, disallow means that you are not allowed to crawl the page. You have to crawl the page to no you cannot index it. But how do you index it if you do not crawl the page, well if another page that you can crawl and index points to the page you cannot index as authoritative on a keyword then it be in the index with that keyword, even if you do not have the actual crawled content of the page.


It really feels like they need to allow a list of `noindex` pages in the robots.txt then...


The whole point is they don’t want you to easily opt out of it.


Keep walking down that path to find v. heavy regulation.


Not if you buy the government first. I think they’re already a victim of their own greed though.

It’s bad enough I started using DDG for search because the results are now more relevant. Google’s advertising algorithms are designed to subtly nudge sites into paying for placement — which means there’s a “non-content” element to the search results that makes it into the user experience. I feel like there was a tipping point a year or two ago where the results just stopped being useful — The best analogy I can find is how search engines used to be in the days before AltaVista. Then AltaVista came out and the results were far more relevant (if not perfect). Google -> DDG feels like that in 2019.

That “non-content” element will only grow over time as Google seeks revenue growth — growth across all of Google’s non-advertising revenue streams combined are not enough to move the needle compared to the scale their ad business has — of which search ads are by far the most profitable. So they will further try to monetize search; it’s their cash cow but I think a small player like DDG could easily overtake them as the quality of Google’s search results (to the end user) continue to decline.


Agreed re: DDG search quality. It's my own default and preferred choice. Google remains useful for Scholar and Books, but relevance and deceptive ads on SERPS is rapidly declining.


Right, but how do you index a page you weren't supposed to crawl in the first place?


It's like recommending a book you haven't read, and newspapers do that every day.

Basically Google finds the link in other places -> oh that must be interesting, I'm indexing it, without even reading it. So they don't have the actual content, and just use the texts from the sites that link to it.


But they do have the actual content since they show the meta title an description on top of what I assume is heavy NLP to drive the search engine itself.


Robots stops a page from being crawled the noindex tag stops it getting into the index.

Google is also slow to honour 404 and drop pages which can hang around for ages, Bing is much faster to remove 404 pages.


Usually, creating "410 Gone"[0] response for the URL and running the URL through the URL Inspection Tool [1] can help make things a bit faster. But yeah, it does take a while to get these 404s removed.

[0] https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/410 [1] https://support.google.com/webmasters/answer/6065812?hl=en


That distinction exists in many systems. E.g. for cloud events, 404 is considered with skepticism because it could be a race condition in provisioning or transient issue whereas 410 requires data streams to be cut off.


Then you should serve 503


5xx means that the server made a mistake. 4xx means that the caller made a mistake. Sending a request to a GONE url is canonically classified as a “user” or sender error.


> but the SEO people seem to trust these shady sites more.

It makes more sense when you realize that the SEO people (with a few exceptions) are usually pretty shady as well. You rarely hear them recommending that you write better content to get better results, it's always nonsense like "put nofollow on everything so your score doesn't leak".


You need to find better SEO people then :)

But I understand, there's a lot of snake-oil and "one weird trick to rank first" that brings a bad name to the SEO world.

I've seen people go on Fiverr and expect to find top-notch SEOs there.

There's more to SEO than just writing good content. There's a lot of technical stuff that can bite you and your awesome content will never rank.

Stuff like improving site structure, canonicals, learning to deal with multi-language versions of your content, implementing proper redirects, etc,etc is something that a good SEO should be able to fix and improve.


This really has not been my experience with SEO people in the 2010s. They have focused on page load, no errors, good redirect schemes etc


I mean if you're hiring a SEO person isn't this literally what you're paying for -- tricks to increase your search ranking without changing your content?


Not at all. SEOs are more likely to find all the ways you're currently hanging yourself. Some common examples I see are:

- Putting important text inside of images

- Duplicate content out the wazoo

- Not making use of canonicals

- No sitemaps, html or xml

- Page performance issues

- Broken mobile support

And of course, poor content. You can't rank if you don't have content.


I'd argue most of those are necessary for good content (if we don't view content separately from presentation)

> Putting important text inside of images I'm sure the reason for this is that it's hard to parse text from images, and while Google could use their AI to figure it out, they don't bother. But it also prevents blind people from being able to read the text, so it does worsen the experience. > Duplicate content This makes the site harder to navigate for users as well. > Page performace issues Quite obviously makes the experience worse. > Broken mobile support. -..-


SEO should be a bridge between technical and non-technical people that build out sites.

No site's output is 100% because of the tech team - content writers can put in weird code, marketers can add all sorts of stuff to say Tag manager, the robots.txt is likely from 2008. And a site built with code as the primary goal is likely lacking in some marketing oomph somewhere.

Someone who's job it is to find the right balance, and aim to maximise the returns from the single largest source of traffic, is pretty valuable.


Except... they were correct.

Google has now clarified that they're removing the code behind the undocumented items, with noindex called out explicitly.

https://webmasters.googleblog.com/2019/07/a-note-on-unsuppor...

It wasn't officially supported / the recommended way - but it worked (in many cases.)


> evidence to me that this Noindex thing is bullshit

For those who (like me) don't know a lot about this, which side of the argument is bullshit? Have you just been proved right or wrong?


It looks like it's too late for me to edit my comment, but I've been proved right. Putting a Noindex directive directly in robots.txt is frequently suggested, but this seems like definitive proof that that does nothing (at least with Google).

As far as I can tell the inception of this idea was that it was briefly mentioned by some Google employee in an interview. Maybe it was supported in the past or maybe he just misspoke, but I bet even now we'll see people still using this tag.


I recommend finishing the whole comment


read the whole comment. Still confused as well.


While that's great, there should be instances where crawlers should ignore noindex directives. For example, all .gov sites.


I'm not sure I understand your reasoning, why should Google honor noindex everywhere but on .gov websites? What about other countries' government TLDs? What about publicly traded companies? What about personal websites of elected officials? What about accounts of elected officials on 3rd party websites?

That seems like a can of worms not really worth opening.


This might be controversial but everything is fair game everywhere. If you can crawl it, tough luck. It's there and everyone can get to it anyways, why not a crawler?


Because the rules a well-functioning society runs by are more nuanced than "Is it technically possible to do this?"

If you'd like a specific example of why people might seek this courtesy, someone might have a page or group of pages on their site that works fine when used by the humans who would normally use it, but which would keel over if bots started crawling it, because bot usage patterns don't look like normal human patterns.


A society is composed of humans. But there are (very stupid) AIs loose on the Internet that aren't going to respect human etiquette.

By analogy: humans drive cars and cars can respond to human problems at human time-scales, and so humans (e.g. pedestrians) expect cars to react to them the way humans would. But there are other things on, and crossing, the road, besides cars. Everyone knows that a train won't stop for you. It's your job to get out of the way of the train, because the train is a dumb machine with a lot of momentum behind it, no matter whether its operator pulls the emergency brake or not.

There are dumb machines on the Internet with a lot of momentum behind them, but, unlike trains, they don't follow known paths. They just go wherever. There's no way to predict where they'll go; no rule to follow to avoid them. So, essentially, you have to build websites so that they can survive being hit by a train at any time. And, for some websites, you have to build them to survive being hit by trains once per day or more.

Sure, on a political level, it's the fault of whoever built these machines to be so stupid, and you can and should go after them. But on a technical, operational level—they're there. You can't pre-emptively catch every one of them. The Internet is not a civilized place where "a bolt from the blue" is a freak accident no one could have predicted, and everyone will forgive your web service if it has to go to the hospital from one; instead, the Internet is a (cyber-)war-zone where stray bullets are just flying constantly through the air in every direction. Customers of a web service are about the same as shareholders in a private security contractor—they'd just think you irresponsible if you deployed to this war-zone without properly equipping yourself with layers and layers of armor.


Honestly that is the site owners problem. If it can be found by a person it's fair. I genuinely respect the concept of courtesy but I don't expect it. People can seek courtesy but they should have expectations of whether or not it will happen.


So in your view is DoS attack not actually an attack and site owners should just have to handle the traffic?


Techies forget the rule of laws. A dos has intent. A bot crawling a poorly designed website accidentally causing the site owners problems does not have malicious intent. They can choose to block the offender just like a restaurant can refuse service. But intent still matters.


This thread is about what behavior we should design crawlers to have. One person said crawlers should disregard noindex directives on government sites, and you replied that they should ignore all robots.txt directives and just crawl whatever they can. If you intentionally ignore robots.txt, that has intent, by definition.


Not intentionally ignore it by going out of their way to override it, just not be required to implement a feature to their crawler. Apparently parsing those sounds tricky with edge cases. Ignoring that file is absolutely on the table. People of course can adhere to but it's not required and in my opinion shouldn't even be paid attention to.

In my younger years the only time I ever dealt with robots.txt was to find stuff I wasn't supposed to crawl.


If you don’t want something public, don’t allow a crawler to find it or access it. The people you want to hide stuff from are just going to use search engines that ignore robots.txt


If you don't want someone or a bot to find it, don't put it online.


The interesting thing about robots.txt is that there really isn't a standard for it. This [0] is the closest thing to one and almost every modern website deviates from it.

For instance it explicitly says "To exclude all files except one: This is currently a bit awkward, as there is no "Allow" field."

And the behavior is so different between different parsers and website implementations that, for instance, the default parser in Python can't even successfully parse twitter.com's robots.txt file because of the newlines.

Most search engines obey it as a matter of principle but not all crawlers or archivers [1] do.

It's a good example of missing standards in the wild.

[0] https://www.robotstxt.org/robotstxt.html

[1] https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...


> The interesting thing about robots.txt is that there really isn't a standard for it. This [0] is the closest thing to one and almost every modern website deviates from it.

That is changing, and was announced today: https://news.ycombinator.com/item?id=20326067


Yeah, my first reaction to Google heading yet another standard was to cringe but this is one of the situations where I think it makes a lot of sense. They're dominant in the search industry and most other engines tend to take their cue so having them spearhead it seems like a good move.


https://datatracker.ietf.org/doc/draft-rep-wg-topic/ looks to be the current draft; that post doesn’t have a link to it I can find.


Google used to have (must check the code) have problems with robots.txt files that had a BOM - If the First line was

Disallow

The D got mangled and that disallow directive got ignored.


I absolutely understand why they did this, but I have to say I was disappointed to see only 7 commits at https://github.com/google/robotstxt/commits/master dating back to June 25th.

When I read "This library has been around for 20 years and it contains pieces of code that were written in the 90's" my first thought was "that commit history must be FASCINATING".


I think it's pretty rare for a company to make internal commits public once making something open source.


yeah, otherwise you might see commit messages like "wtf is this shit" :)


The first commit has the extra comment `PiperOrigin-RevId: 254932939`

This may[0] be because it is exported from there monorepo

-[0] https://news.ycombinator.com/item?id=14811937


> This library has been around for 20 years and it contains pieces of code that were written in the 90's.

Whilst I am sure there are good reasons for the omission, it would have been interesting to see the entirety of the commit history for this library.


From a archeological perspective very much.

From Google's perspective it's probably too much work. I would assume this was a part of the cralwer code and extracted over time into a library, while part of the monorepo, so changesets probably didn't only touch this code, but also other parts and this code probably depended on internal libraries (now it depends on Google's public abseil library) publishing all that needs lots of review (also considering names and other personal information in commit logs, TODO comments and their like)


Not only that, code libraries that weren’t designed to be open source often have things in them that Google might want to show: codenames, profanity, calling out specific companies…


Also, even if it is authoritatively managed in git now, the whole 20 year history certainly wasn't (since git is only 14 years old, and Google probably didn't adopt it on day one), and it's quite likely commit history wasn't converted,so it's quite possible Google couldn't easily make the whole history available when publishing it to GitHub even if they wanted to.


I assume the authoritative version is still in Google's Piper-based repo and previously was in perforce and I assume that was for a while ... so if there were interest Google's could dig deep. But I assume there are other projects where this is even more interesting. (how ranking changed over time; how storage formats for the index changed; ...)


I can attest to this. I work in a very large monorepo with tens of thousands of commits. Even files that aren't changed often have regular updates - usually repo-wide CodeMods. This makes the blame less useful and the history quite noisy. I figure the robots.txt parser's history would be in a similar state - not very useful or interesting to read.



It's also the 5th link "open sourced" in the article.


Note that this is quite strict on what characters may be contained in a bots user agent. This is due to strictness in the REP standard.

https://github.com/google/robotstxt/blob/master/robots_test....

    // A user-agent line is expected to contain only [a-zA-Z_-] characters and must
    // not be empty. See REP I-D section "The user-agent line".
    // https://tools.ietf.org/html/draft-rep-wg-topic#section-2.2.1
So you may need to adjust your bot’s UA for proper matching.

(Disclosure, I work at Google, though not on anything related to this.)


The strictness is in what may be listed in the robots txt, not the User-Agent header as sent by bots. the example given in the linked draft standard[0] makes this abundantly clear that it's on the bot to understand how to interpret the corresponding lines of robots.txt.

Of course, in practice robots.txt tend to look less like [1] and more like [2].

[0]: https://tools.ietf.org/html/draft-rep-wg-topic#section-2.2.1

[1]: https://github.com/robots.txt

[2]: https://wpengine.com/robots.txt


Sorry, I mean for matching, and I did try to imply it was a limitation of the standard and not the library. Though to avoid confusion, I do personally think keeping the user agent minimal is wise, since users might have difficulty guessing what value to use if it differs sufficiently from the real user agent that's sent.


I wonder how much noindex contributes to lax security practices like storing sensitive user data on public pages and relying on not linking to the page to keep it private. I wonder how much is in the gap between "should be indexed" and "really ought to restrict access to authorized users only".


I am hoping not much, because that is beyond a horrible "security practice". I have seen some lazy shit out there, but this would take the cake.


If I recall correctly there was a large company several years ago who tried to prosecute a whitehat who discovered their user account pages included the users' e-mail addresses and that changing the address to that of a different user would drop you right into that user's page with all their personal information listed.


> how should they deal with robots.txt files that are hundreds of megabytes large?

What do huge robots.txt files like that contain? I tried a couple domains just now and the longest one I could find was GitHub's - https://github.com/robots.txt - which is only about 30 kilobytes.


They enumerate every page on the site sometimes specifically for different crawlers.

Or they have a ton of auto generated pages they don’t want crawled and call them out individually because they don’t realize robots.txt supports globing.


Can you given an example in the wild?


I was actually trying to find an example when I made my initial comment, but was unable to. It's been a long time since I did web scraping. Since then there are a lot more frameworks that help you build a website (and a correspondingly sane robots.txt), so there may not be as many as before.


fun & useless little bit of trivia: Sci-Fi author [1] Charles Stross (who hangs around here) is the cause of the first robots.txt being invented.

http://www.antipope.org/charlie/blog-static/2009/06/how_i_go...

(reminds me how Y Combinator's co-founder Robert Morris has a bit of youthful notoriety from a less innocent program)

[1] and former code monkey from the dot-com era


I guess lots of people misspell ~disalow~ disallow[1]

1. https://github.com/google/robotstxt/blob/master/robots.cc#L6...


including yourself! Must be easy to do. :) https://www.merriam-webster.com/dictionary/disallow


Oh snap!


I doubt there's any vulns in the code seeing as its job for th last 20 years has been to parse input from the wild west that is the internet, and survive.

But I'm sure someone out there will fuzz it...


I’d be surprised if google isn’t fuzzing it with their (also open sourced) fuzzing tool.


I'd never heard of this, so looked into it:

https://opensource.googleblog.com/2019/02/open-sourcing-clus...

Fascinating.


can this been seen as a initiative to make google robots.txt parser the internet standard? every webmaster will want to be compliant with google corner cases...


That's probably the hidden agenda.


It's not even hidden, Google explicitly says that in the blog post.



There is a difference between robots.txt blocking a page and noindexing a page.

Blocking in robots.txt will stop Googlebot downloading that page and looking at the contents, but the page may still make it into the index on the basis of links to that page making it seem relevant (it will appear in the search results without a description snippet and will include a note about why).

To have a page not appear in the index you need to use a 'noindex' directive [1] either in the file itself or in the HTTP headers. However, if the file is blocked in robots.txt then note Google cannot read that noindex directive.

Also, in the StackOverflow response you linked to that the user agent is listed just as 'Google', but it should be 'Googlebot' as per the 'User agent token (product token)' table column listed in [2].

Good luck! :)

[1] https://support.google.com/webmasters/answer/93710?hl=en [2] https://support.google.com/webmasters/answer/1061943


That's actually nice and straight forward and relatively simple. I had expected something over engineered with at least parts of the code dedicated on demonstrating how much smarter the code writer is than you. But it's not. Just a simple parser.


I expected the same (complex project structure, too many files, difficult to read, etc), but I love everything about this library. Easy to read, concise code, in two simple files. Very well tested, both by automated tests and the real world. Sticks to the Unix philosophy: does one thing and does it well.

Can you imagine how many billions of time this code has been executed? I love software like this.


Looks like standard Google C++ coding style to me.

Honestly, excessive cleverness does not generally pass code review @ Google. Especially something that would get this many eyes.


Honestly I'm most surprised they haven't replaced all the c-isms already. Seeing raw char pointers and strbrk is...weird.


Yes as a Googler, I'd probably flag that in review.

But it being old and critical, I'd also be wary of major changes.


I wonder if anyone ever found/accidentally triggered a buffer overflow in this or another part of googlebot.

I imagine these days it’s been incredibly hardened and is additionally sandboxed. But back in the day?


> Honestly, excessive cleverness does not generally pass code review @ Google.

And yet the Google style guide literally says: "Assume the person reading the code knows Python better than you do."

https://github.com/google/styleguide/blob/gh-pages/pyguide.m...


You're very much taking that out of context. Read the entire section. There's a grand total of 7 sentences in there. It literally says:

If you're going to have to explain it at the next code review, you should comment it now. Complicated operations get a few lines of comments before the operations commence. Non-obvious ones get comments at the end of the line.

The section you're quoting says:

On the other hand, never describe the code. Assume the person reading the code knows Python (though not what you're trying to do) better than you do.


I think you might be misreading the meaning of this? If the person reviewing knows the language better than you, they they are hopefully less likely to tolerate "clever code", not more.

By "clever code" we're talking about weird unidiomatic tricks and hacks that maybe writes things in a slightly shorter or in a fractionally more (unnecessarily) optimised way, and makes you feel clever, but makes it harder and more time consuming for anyone else to understand what your code is doing, or verify that it's actually doing what it's supposed to.


Why would you expect that?


Guessing it was a jab at Google's interview process?


I'm not sure how you could over-engineer code on a whiteboard? There isn't enough space. Instead you'd expect people to be extremely good at writing short and concise code which is still correct and simple enough for an interviewer to understand.


Seems strange to get excited about a robots.txt parser, but I feel oddly elated that Google decided to open source this. Would it be too much to hope that additional modules related to Search get released in the future? Google seems all too happy to play the "open" card except where it directly impacts their core business, so this is a good step in the right direction.


Looking forward to the robots.txt linters created as wrappers around this (especially for VSCode).


I find it really cool the code for this is so simple and clean.


I don't understand the entire architecture behind search engines, but this seems like a pretty decent chunk of it.

What are the chances that Google is releasing this as a preemptive response to the likely impending antitrust action against them? It would allow the to respond to those allegations with something like, "all the technology we used to build a good search engine is out there. We can't help it if we're the most popular." (And they could say the same about most of their services: gmail, drive, etc.)


So, is it premature to expect a Go package by Google as well?

There's already https://github.com/temoto/robotstxt


This is the sort of code you write a binding to and call it a day, since the entire point is to absolutely precisely match the behavior of this code, which is basically a specification-by-code. You can never be sure a re-implementation would be absolutely precisely the same in behavior, so it's not worth doing.


The c++ implementation is <1000 lines. Doesn't seem like a correct port would be particularly difficult, especially with a reasonably large test corpus.


Famous last words.

I mean, I get it; it feels that way to me intuitively too. But I'd still recommend against trying it, because I've learned the hard way the intuition here is, if not wrong, at the very least very badly underestimating the cost, especially in the "unknown unknown" department.


I'm not saying that isn't true for some things. I don't think its true here given that this is a nice narrowly scoped library that does a single thing and has well defined semantics.

Adding a cgo dependency is generally something that isn't done lightly by teams. Having a port to go instead of a wrapper around go would be much more likely to see widespread adoption.


Do you even need to match Google's robots.txt parsing behavior? With less than 1000 lines you can be pretty sure they are not doing it right and are breaking plenty of people's assumptions about it. Either way you have to test it on real world data.


The point of this code release seems to be to release Google's precise logic. That you may incorporate it into something else is, IMHO, less interesting; we've got plenty of other solutions that "do robots.txt" well enough. If it was just about that, Google's release of this would not be worth anything. The point is so that non-Google parties can see exactly what Google is seeing in your robots.txt.

That's why I'm saying there's no point trying to re-implement this. If you were going to re-implement this, there's probably already a library that will work well enough for you. The value here is solely in being exactly what Google uses; anything that is a "re-implementation" of this code but isn't exactly what Google uses is missing the point.

If they formalize it into a spec, others may then implement the spec, but they can and should do that by implementing the spec, not porting this code.


As I understand the point about Go complaint is to parse actual real world robots.txt. For which you don't need to behave exactly as this library does.


> Do you even need to match Google's robots.txt parsing behavior? With less than 1000 lines you can be pretty sure they are not doing it right and are breaking plenty of people's assumptions about it.

This seems like a weird assertion. The specification isn't particularly complex (ignoring the implicit complexities of unicode). There are ~5 keywords and like 3 control characters. Why would you expect to need all that much?


Very few people follow the specification or even know it exists.


I'm not talking about the formal specification, but the implicit specification of what people have been using for decades. That only has 5 keywords and a couple control characters. The formal spec is based on that informal spec, which again, isn't that complicated.

To be more direct: what are all of these assumptions you assume google's parser is mishandling?


Top comment [1] talks about noindex directive for example. Some people definitely expect it to work.

[1] https://news.ycombinator.com/item?id=20326098


It definitely feels excessively risky for a third party to port it, but Google can either canary it or run both parsers in production and compare results to accurately assess confidence in the port's correctness.


I imagine it'd be fairly easy to fuzz as well. Throw strings at it until the C++ and <other language> tools disagree


Cool, thanks


Is Golang significantly slower than c++ ? I thought Google had invented Golang to solve precisely these kinds of code for their internal use.

I had thought most of the systems code inside Google would be golang by now. is that not the case ? the code doesnt look too big - I dont think porting is the big issue.


Golang didn't exist in the 90's though.

Rewriting decades of core business logic would be a tremendous effort and amount of risk.


Quite the contrary, three Google employees that are very vocal against C++, and well known personalities, got fed up using it and created Go, eventually they got support from upper management.

Java and C++ still run most of Google.


>I dont think porting is the big issue.

Why do it in the first place? Just because you can? The code works and it's written in a popular language which plenty of people know. What's the upside?


I would much prefer a library such as this be done in C/C++ so it could be packaged up as a library that could be called from other languages. Pretty much every major language has some form of FFI to call out to C/C++ code. This way, you can get consistent behavior if you need to parse robots.txt in python vs ruby vs java vs etc.


I know it's a meme to say "C is not C++" but in this context C is really not C++. Calling into C through FFI is significantly easier than calling into C++. Very few languages have decent FFI with C++, while many have great support for C.


> so it could be packaged up as a library that could be called from other languages.

What would this mean for C++, if not an `extern C` interface?


COM is an example


> I had thought most of the systems code inside Google would be golang by now.

Google has gazillions of lines of system code already built. Why rewrite everything in go? There is so much other stuff to do. All rewriting achieves is add additional risk because the new code isn't battle tested.


Why port something that's working fine?


> Is Golang significantly slower than c++ ?

Depends the context, but in general, yes. C++ is very close to C on this aspect, trading memory safety for performances.

Concerning google, as far as I know the codebase is mostly C++, Java, and python. Go will surely eat a bit of the Java and Python projects but it’s unlikely to see C++ being replaced any time soon.


> Is Golang significantly slower than c++ ? > Depends the context, but in general, yes.

I don't believe this is the case. Most optimized, natively compiled languages all perform similarly. Go, C, CPP, Rust, Nim, etc. I'm sure there are edge-cases where this isn't the case, but they all perform roughly the same.

The performance rift only starts when you introduce some form of a VM, and/or use an interpreted language. Even then, under certain workloads their optimizations can put them close to their native counter parts, but otherwise are generally slower.

The real reason Google didn't re-write this in Go is likely because the library is already finished, it works, a re-write would require more extensive testing, etc. Why spend precious man-hours on a needless re-write?


Golang and Java have similar performance characteristics. I would not put it in the same class as C/CPP/Rust.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


I didn't realize Go was so far behind C in terms of performance.

Like REALLY behind, this surprises me a lot actually. Thanks for showing me that awesome benchmark page :D


Presumably less startup time and JIT.


The run times are seconds not micro-seconds.

How many seconds do you suppose startup time is for those tiny programs?

How many seconds do you suppose JIT is for those tiny programs?


Depends on if the Java code is ahead of time compiled or not. The AOT compiler is included in the latest JDKs.



Nice, good to know the JIT is doing something. Will be interesting to track the performance of this over time.


And that's graalvm-ce not the enterprise edition.


> they all perform roughly the same.

This argument to me is usually comes from people who have not done projects of significant scale or that required high performance, which is fine not everyone works on that level of a project. But the small difference of 10ms per operation when having to do a million operations is nearly 2.7 hours of extra time. Even 1ms is an extra 0.25/hr in time. These things start adding up when you are talking about doing millions of operations. And there is nothing wrong with Go or Rust or Python, just they aren't always the right tool in the toolbox when you need raw performance. Neither is C/C++ the right tool if you don't need that level of control/performance.

When doing distributed systems or embedded work you generally learn these rules quickly as one "ok" performing system can wreck a really well planned system, or start costing a ton of money to spin up 10x the number of instances just because of one software component isn't performant.


Rust was created by/for systems programmers who are in exactly that situation - where performance and control are not optional - and thus have been stuck writing C++ for decades. Although C++ has evolved over the years, there are pain points, particularly regarding modularity, that persist and may require a clean break.

It's still somewhat early but I do already see software being written in Rust with best in class performance (take ripgrep for a prominent example), so lumping it in with Go and Python is really a category error in my opinion.

Personally, I'm still writing C++ for the platform support, etc. but not pretending to like it.


Yea, I see your point, lumping Rust in with Python or Go isn't really fair nor accurate.

Totally agree C++ definitely has pain points still, but I do love the fact C++ is getting pretty regular updates so it is getting better and less painful generally. Rust is something I want to use in production but haven't seen the right opportunity to do it where the risk to reward ratio was right, yet.


I concede that I mostly work on web applications that don't require that level of performance.

You're certainly right, there IS a performance difference, and in high-computing workloads, such as the one this parser is used for.

From a "regular" web developer perspective (ie. where you only a few servers/VPS's MAX) a lot of newcomers often worry about performance, and usually for most web development the answer with performance is "Yes language [here] is faster then Python/Javascript/Ruby/etc. But those languages/frameworks allow us to develop our application far faster, and ~10ms isn't an issue." Only after performance bottlenecks are discovered would we consider breaking out pieces into a lower level language.

You're completely right though, in HPC it is totally worth worrying about every millisecond, I took the wrong perspective with the implications of the performance differences.


To be fair, most people do not work on applications that need that level of performance.

Most of the time, and to your point, that level of performance isn't necessary so using a language that is less likely to let you take your foot off is generally the best & most correct choice. I only resort back to C/C++ when I need the pure raw performance like this parser would, or when doing embedded work. Otherwise I reach for other tools in the tool-bag that are less likely to let me maim myself unintentionally.


“Most code” is debatable, but one of golang’s goals is ”Go compiles quickly to machine code” (https://golang.org/doc/). Because of that, I can see it being slower than C++ on code that benefits a lot from optimization. That makes it not the right choice for code that runs a lot, as this code likely does (I expect this code runs for many CPU-years each day)


I didn't realize how far off Go's performance was from C/CPP, I have a feeling a lot of it is because of the 25+ years of optimization the c/CPP compilers have gotten.


Go has a “stop the world” garbage collector, and some language features also have performance penalty (defer is well known for being slow). Just to say that’s not only a question of time, even if you wait and invest a huge amount of time and money you will see differences in performances because of language design choices.


Go has a concurrent GC not a stop the world GC.


Right, thanks for correcting. That doesn’t change my point though.


> I don't believe this is the case.

Please show why you don't believe this is the case.


I don't mean to be rude, but the rest of my comment explains why. If there's something about it that's confusing I'd love to clarify it.


> the rest of my comment explains why

The rest of the comment makes some claims about performance, but does not show why we should believe those claims.


> Depends the context, but in general, yes.

For example?

Otherwise we just have: yes it is! no it isn't!

https://en.wikipedia.org/wiki/Argument_Clinic


Golang, as a garbage-collected execution environment, suffers from lousy tail latency


Actually Go has the reputation of having solved many runtime problems including the GC tail latency problem.


“Because the REP was only a de-facto standard for the past 25 years, different implementers implement parsing of robots.txt slightly differently, leading to confusion. This project aims to fix that by releasing the parser that Google uses.”

The amount of arrogance in this sentence is insane.

Because Google way is the only one true way?


In terms of “what should a robots.txt file look like to be parsed correctly,” yes, because they’re the ones who are going to be doing most of that parsing. Yes, ideally it would be an entirely independent standardization process, but it’s not arrogant of them.


Because google is the only search engine most people care about....


Never before has a company stood on such a mountain of open source code, achieved so much money with it and contributed so little

No really. Microsoft? BSD TCP/IP stack for win95 maybe saved them but there was trumpet winsock and probably would have survived to writing their own on the next release.

Google doesn't get off the ground and has literally no products and no services without the GPL code that they fork, provide remote access to a process running their fork and contribute nothing back. Good end run around the spirit of the GPL there and that has made them a fortune (they have many fortunes, that's just one of them).

New projects from google? They're only open source if google really need them to be, like Go which would get nowhere if it wasn't and be very expensive for google to have to train all their engineers rather than pushing that cost back on their employees.

At least they don't go in for software patents, right? Oh, wait...

At least they have a motto of "Don't be evil" Which we pretty much all have personally but it's great a corporation backs it. Corporate restructurings happen, sure, oh wait, the motto is now gone. "Do the right thing" Well this is fine and google do it, for all values of right that equal "profitable to google and career enhancing for senior execs".

But this is great a robots.txt parser that's open source. Someone other than google could do something useful for the web with that like writing a validator, because google won't. Seemingly because it's not their definition of "do the right thing."

"Better than facebook, better than facebook, any criticism of google is by people who don't like google so invalid." Only with more words. Or none just one button. Go.


So you aren't wrong that google is built on the shoulders of giants, but I will point out that every single company today running their SaaS offering on top of linux/BSD is doing the exact same thing.

The only reason Linux is as mainstream as it is today, is exactly because of this freedom to leverage the code. You even point out that the cause for Golang's success is for precisely the same reason. Overall opensource isn't about making money, it has never been about making money. Its been about making an impact, and bettering the world around us all by giving a piece of technology to be freely used by everyone. There are a variety of opensource licenses that can/will protect your code from any/all closed source uses, for example AGPL explicitly states if your application so much as interacts with the code over a TCP connection or furthermore a single UDP packet it must be opensource as well. However you will rarely see libraries/applications using this license. Why you might ask? The answer is simple, it reduces the impact that code can have.

Really at the end of the day, it comes down to a choice of the developer(s), do you want to make money? i.e. go the Microsoft/Apple route? or do you want to make an impact? i.e. go the Linux/BSD route?

Let me ask one final question, which of the above operating systems do you think are more widely used, or have changed the world in a more dramatic manner?


I could care less about other companies that have existed for 5 minutes in the SaaS space in my comment that nobody has ever derived more value and given that, contributed less back.

Google is built on an end run around the spirit and intent of the GPL. "Don't distribute software, distribute thin client access to it! No GPL! Hurrah! Money!"

Decide for yourself what you think of that but it happened. Without it, no google.

But hey, list anyone you think derived more value and contributed less back. It's a reasonable thing to do. Doesn't affect criticism of google.


Google does not contribute to OSS? https://opensource.google.com/


Who said they "don't contribute"? Not me, so given you're replying to me that's a straw man argument.

Oh and your link? That's a propaganda site heavy on aesthetic design and basically devoid of fact.

Apart from that it's a really strong response. Do you love google? Work for them? Reflexively stick up for big business?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: