Google Isn't Just Reading Your Links, It's Now Running Your Code

Locke1689 · on June 26, 2010

To be fair, while the halting problem could come into play here, it really shouldn't in practice. When Google is running the code there shouldn't really be a functional difference between a script which takes 20 seconds to run and one which takes forever -- kill the process either way.

dododo · on June 26, 2010

a Turing machine has an infinite tape. a typical computer has finite memory. therefore, it's tape is upper bounded by a constant function. this is a kind of linear bounded automata: http://en.wikipedia.org/wiki/Linear_bounded_automata

linear bounded automata don't have a halting problem. it's perfectly decidable. this doesn't detract from the fact it may take a long time to decide, but it is a far easier problem than halting on a TM.

jules · on June 26, 2010

The Javascript spec probably doesn't specify a fixed memory limit, so limiting the space arbitrarily to X mb is not that different from limiting the time arbitrarily to Y s.

alextp · on June 26, 2010

Except verifying termination with a bound in the seconds of computation can be done in O(n), where n is the maximum number of seconds allowed (you can see that trivially by just waiting n seconds and killing the program if it doesn't terminate). On the other hand, if you try to limit a program's memory usage, it can take up to O(2^n) time to detect that a program with budget of n bits terminates (it might seem easier than that, but the program can go into an infinite loop without overflowing memory, like "while (1) {}"). You must consider that such a program can never have more than one state for each possible configuration of its allowed memory space (which is 2^n) and you can tell that it is in and infinite loop only when it repeats a previously used state. To do that correctly, then, takes a long time.

Incidentally, this is why PSPACE (the class of decision problem decidable in a turing machine with polynomially limited space) seems a lot more powerful than P (the set of decision problems decidable using a polynomial amount of time).

jrockway · on June 26, 2010

Now you know what V8 is really for.

paulgb · on June 26, 2010

You know when programmers write blog posts about business and finance? I wonder if finance people cringe at those as much as we did while reading that.

postfuturist · on June 26, 2010

"--- it's quite the accomplishment."

There is a severe technical misunderstanding here.

risotto · on June 26, 2010

V8, Chrome, and the Developer Tools do run my code, just when I ask it to. Guess it makes sense Google is asking it to too.

So what, they must be able to query and manipulate the dom, only after Javascript has run.

jsdom -- DOM in Javascript (V8 via Node.js) -- is a nice little hack and mind bender. Why can't the document be a javascript object that you manipulate??

Javascript all the way down. and up.

tocomment · on June 26, 2010

Wouldn't it be interesting if google tried to run all the code snippets and full applications found on the web and tried to automatically figure out what they do?

Maybe that would be the first step to automated programming?

ck2 · on June 26, 2010

Yeah google has been following my javascript obfuscated links for quite some time now.

It's actually a tad annoying because there's no way to do a "nofollow" in javascript.

pkulak · on June 26, 2010

That's not what nofollow is for anyway. Just update your robots.txt file.

ck2 · on June 26, 2010

Putting it in robots.txt is like advertising to bad bots exactly where to go.

On some sites what I do is make a fake virtual directory and just ban the directory in robots.txt while prefixing the links with the fake directory path.

Oh and I swear google sometimes follows stuff in robots.txt, it just doesn't index it. Then there's the "alternate" user agents they use like pretending to be IE (from a real Google IP).

jrockway · on June 26, 2010

The solution is to password-protect the parts of your site that you don't want bots to see. Javascript obfuscation is silly. Robots.txt is advisory. Passwords are simple and failsafe.

pkulak · on June 26, 2010

Who said anything about using robot.txt for security? I just use it for keeping Google from indexing my print pages, for example. Anything that I don't want the world to see, can't be seen without authorization.

tomjen3 · on June 26, 2010

Putting it in robots.txt is like advertising to bad bots exactly where to go.

Exactly. Right to your special page that earns a 24 hour ip ban.

It doesn't take that long to code and should keep any bot away.

Niten · on June 26, 2010

I just hope they don't start caching and displaying obfuscated links, since JavaScript obfuscation is how I currently manage to display a working mailto: link on my personal web page while simultaneously giving myself nominal protection from address harvesters. Is this no longer a viable strategy for protecting one's email address on the web?

acdha · on June 26, 2010

This was never more than a minor delaying tactic: spammers have been capable of running JS for a long time (harvesters embedding IE via ActiveX have been around for at least a decade) but there hasn't been much pressure to do this since they still got billions of addresses which weren't obfuscated or were obtained by other means.

I'll also note that I've never obfuscated my email address (chris@improbable.org) and spam arrived in roughly equal volumes there and to less public addresses which were guessable (except for test accounts created with MD5 hashes[1]). People do harvest addresses from web pages but there are enough other ways spammers get addresses that you need a robust filter in any case; once you have something like SpamAssassin or Google's filters, the value of keeping your address private disappears.

[1] That also yielded the interesting datapoint that hotmail used to sell addresses to spammers in early last decade as the first spam arrived within 20 minutes to a long, random address created with the "don't publish my info" option.

spokey · on June 26, 2010

I do this too, and despite what other replies here suggest, anecdotally it does seem to work.

My approach is a little more convoluted than simply document.write or something like that, so maybe that matters. I get crawled by bad bots all the time, but these addresses never seem to get spammed. The same is not true for addresses displayed in plain-text somewhere.

My explanation has always been that there are so many plain-text addresses on the web that it isn't necessary to add JavaScript parsing logic to most harvesters. Much like car alarms or door locks--you don't need to make them unbreakable, simple harder to get at than the one next door. Someday we may need to start displaying email addresses as (or behind) captchas.

JoachimSchipper · on June 26, 2010

To the extent that it ever worked, it still works - Google doesn't index mailto: links.

mootothemax · on June 26, 2010

Do you know that for a fact? Just because something's not publically available doesn't mean it's not in the back-end.

JoachimSchipper · on June 26, 2010

I don't know whether or not they are in Google's data center somewhere; but to the best of my knowledge, they cannot be extracted from that data center, which means it's largely irrelevant as far as spam is concerned.