Hacker News new | past | comments | ask | show | jobs | submit login
Googlebot’s JavaScript random() function is deterministic (tomanthony.co.uk)
311 points by TomAnthony on Feb 7, 2018 | hide | past | favorite | 108 comments



I wrote about something like this previously[1]

Ads often have something like this attached to the end:

    load("https://adserver/track.gif?{stuff}&_=" + Math.random())
My ad server collected multiple tracking pixels -- for various events and after five pixels I could fingerprint the browser (Firefox, Chrome, MSIE, etc) and identify someone fiddling with the user agent string, or using a proxy server to mask this information.

[1]: http://geocar.sdf1.org/browser-verification.html


As an aside, the reason for that random string on the end is to bypass caching mechanisms. A more or less random generator is good enough.


How would a person defend against this fingerprinting?


Just to throw out a different approach: buy the most popular computer and use the most popular browser. Don't change the user agent or make any other advanced user changes.

Going to extreme measures to be untraceable is like wearing a Ghillie suit to the airport.


IMO, one of the most difficult fingerprinting attacks to defend against is the installed fonts list. I wish incognito mode only made a standardized list of fonts available.


Firefox is experimenting with a font whitelist as part of upstreaming of Tor features into Firefox trunk. These features are controlled by the privacy.resistFingerprinting pref. resistFingerprinting has some webcompat issues, but Mozilla is considering a subset of the protections for Privacy Browsing mode, similar to Tracking Protection. Here is the work-in-progress bug:

https://bugzilla.mozilla.org/show_bug.cgi?id=1336208


I wonder if the fontSettings section of the Chrome extensions API[1] can be used to defend against that. By randomly and deliberately poisoning that info for when in Incognito mode (or even regular browsing).

[1] https://developer.chrome.com/extensions/fontSettings


Why does that feature even exist?


So dessigners can make adjustments depending on what fonts are available.


Is there any website actually using it for that purpose? Usual CI policies would be "actively disinterested" in it.


Any website with a rich text editor would use it.


AFAIK it's not really a feature it's more of an exploit.

Ultimately, the sizing of a font can effect element widths, and element widths can be queried.


Buy an iPhone.

• Did proper random before anybody else

• Active countermeasures against cookie-based retargeting

• Popular enough market that merely having an iPhone in a geographic area doesn't single you out

There's a guy who downloads every page with curl. I see him on web logs. I think he must have some script that parses the amp pages out and does something with it, but because he's the only person in that geographic region who browses the web with curl, he's very easy to spot from a tracking perspective. On the other hand, because he's using curl, I don't think anyone wants to bother trying to show him an ad.


That would be Stallman.


Stallman would probably use GNU wget which is licensed under the terms of the GPL, while curl is licensed under a license derived from MIT/X consortium.


He would probably have someone print it for him, similar to his email queue. I email him from time to time and his disclaimer says this.


I've never heard him say anything bad about running MIT licensed software.

On the other hand, I think I've read a FAQ where he says he views web pages by having a script fetch the pages with wget and them emailing them to him, but I'm pretty sure that's simply because wget is free and does the job (being able to recursively download the necessary resources), not because it's GPL licensed.


Obviously Stallman is modifying his wget user agent to show curl just to prevent fingerprinting him accurately... ;)


I love the idea of someone individually targeting Stallman with ads.


"I am careful in how I use the Internet.

I generally do not connect to web sites from my own machine, aside from a few sites I have some special relationship with. I usually fetch web pages from other sites by sending mail to a program (see https://git.savannah.gnu.org/git/womb/hacks.git) that fetches them, much like wget, and then mails them back to me. Then I look at them using a web browser, unless it is easy to see the text in the HTML page directly. I usually try lynx first, then a graphical browser if the page needs it (using konqueror, which won't fetch from other sites in such a situation).

I occasionally also browse unrelated sites using IceCat via Tor. Except for rare cases, I do not identify myself to them. I think that is enough to prevent my browsing from being connected with me. IceCat blocks tracking tags and most fingerprinting methods."

https://stallman.org/stallman-computing.html


For the record, Stallman uses GNU IceCat these days.


I don't believe it's RMS because this user appears to "browse" the web interactively. A lot of the sites he hits are on HN so I think he's a user here.


I used to browse with lynx till few years ago. I am not sure if it is practical to do that anymore. Most of the web is no longer friendly for text browsing.


Somewhat suprisingly, youtube is not too bad. I've used elinks to get a URL and then put that into youtube-dl. I suspect that any site that is designed to be friendly to a (vision impaired) screen reader is going to be friendly to text based browsing.


I'll second that. As someone who once worked on fingerprinting scripts I can tell you that iPhones of the same model are basically indistinguishable to Javascript.


Generally speaking, you _can't_ prevent a page from being able to tell what browser you're using. Even with JavaScript completely disabled, there's probably still some quirks with the way different browsers handle CSS or image loading that would give away that information. Even curl can be "fingerprinted" this way, because curl is one of the only "browsers" that doesn't process CSS or images at all.

If you just want to prevent yourself from being identified as an individual, that's a different problem. Tor browser does a pretty good job of solving that.


Use curl. Disable javascript. Replace Math.random with your own function via extension, etc.


Lawyer up, delete Facebook, hit the gym?


Disable javascript and use the web as a hypertext document-store.


Use distinct browsers (or profiles) for browsing web-of-documents, and using apps hosted on the web-as-application-distribution-platform. They're each (for many of us) legitimate uses, but have different requirements and threat models.

Cons: interactive infographics and courseware don't fit neatly in either.


Well, depending on whether a script tag gets its own context, and each context gets the same seed (in which case you're boned?), you could use an extension of some sort that runs a random number (through your own pseudo random function with time and/or page location as seed inputs) of Math.random() calls to mix up the results.

Or just use an extension that replaces Math.random() with something more random, but it's possible that could cause weird performance problems on certain pages and it would be hard to debug.


It's an arms race... there is so much to fingerprint ... usually people that deploy fingerprinting have more of an issue deciding which tasty morsel of bits to go for first ...


Looks like it's no longer possible on recent browsers


Seems reasonable to me. My guess is that it's not performance, but rather predictability, that matters here. Being able to detect when a page meaningfully changes is probably useful for Google, and a good implementation of Math.random() would potentially thwart that. Especially seeing how many pages have the magic constant in them...

Also, probably useful for determining two pages are the same, which may be needed to help prevent the crawler from crawling a million paths into a SPA that don't actually exist, for example.


I wonder if they also reimplement or adjust crypto api since it offers better random numbers


If I had my guess, my guess would be Googlebot simply disables the API. It's new enough that this would be reasonable, and executing real crypto in context of Googlebot is probably rarely desirable.


Seems like the expectation on PRNGs being expressed in this thread is a bit unrealistic. They're always deterministic. The fact that this PRNG is also always seeded the same makes it easier to fingerprint, but that has no bearing on whether the PRNG is deterministic.


Random should be undetermined. While PRNG should only be for cryptography. Both serve two different purposes.


That would be terrible! I rely on the deterministic behavior of PRNGs all the time. For instance, I often generate random test vectors. If I have a failure, I want it to be reproducible so I can fix it. And it is, as long as I supply the same seed.


So you found another use for PRNGs. Random should be close to real world multi-side dice with seed entropy used from multiple sources such as video card, nic, and storage buffers.

PRNG != Random.


There are lots more examples: heuristic optimization, discrete event simulation, sampling... I could go on. Deterministic RNGs are much better in all of these applications, where reproducibility of results is important. I'm sure nondeterministic RNGs have important uses too. Perhaps you'd care to describe some of them.


I know you're being facetious because the OP isn't correct that deterministic PRNGs aren't useful, but any cryptographic application of a PRNG should be non-deterministic.


Well, crypto isn't my specialty, so I always tread lightly when commenting about it. My hand-wavey understanding is that CSPRNGs are deterministic but not predictable, and that actual entropy is used to seed them. Going even further out on a limb, I think this is supposed to be the difference between /dev/urandom (CSPRNG) and /dev/random (actual entropy.) If I have that wrong, I'd appreciate correction by somebody in the know.


You're correct that CSPRNGs themselves are deterministic. I probably just misread what you were saying. As for dev/urandom vs /dev/random that's not really true. On Linux there's kind of a historical artifact of why they're different (blocking vs non-blocking API) but on OSX /dev/urandom is a symlink to /dev/random.


> but any cryptographic application of a PRNG should be non-deterministic

You've just ruined most stream ciphers. That's not true at all.


You're right, I was just referring to the ability of controlling the seed value. In simulations you have a deterministic seed to be able to reproduce issues. For CSPRNG you want a non-deterministic random seed & so CSPRNG don't even offer an API to set the seed.


Maybe Googlebot forks its JS-engine from a pre-initialized image. That would explain the unchanging seed.


Regardless of other technical limitations I'm guessing that it's actually done on purpose, as point #3 in TFA states:

>Predictable – Googlebot can trust a page will render the same on each visit

It's probably important for Google's crawler to identify whether a page changed or not, if some elements in a page are randomly generated they may want to limit the impact.

I mean, after all they seem to use a real, changing value for their Date, so if they wanted they could just seed their RNG with that.


This is known for a while and used to cause huge problems for our tracking: https://github.com/snowplow/snowplow-javascript-tracker/issu...


Under the assumption that the same event fired from the same IP at the same time, with the same environment, etc. would be considered a duplicate in your system, I'd have designed this system to be idempotent-insert-only and use content-addressing instead of nonces for identity (event ID = suitably large hash of event data to avoid collision). If that assumption doesn't hold, then add your nonce to the event data (and thereby modify the hash).


https://developer.mozilla.org/en-US/docs/Web/API/RandomSourc... can be used to generate cryptographically strong random numbers. It's not as cheap as Math.random though.


Nice to see fingerprinting used against Google (instead of by Google). But this is easy to fix, so I don't expect this to work for much longer.


I think how easy it is for Google to fix to make detecting googlebot this way harder depends on why Googlebot is doing it in the first place, which we don't really know. If it's done for performance reasons, or for predictability reasons (rendering the same page twice guaranteed or at least more likely to produce the same result), it might be difficult to change without cost.

But I believe Googlebot always faithfully sends it's user-agent. Is there a reason Google would care about 'fixing' this to make Googlebot harder to detect via random() predictability, when you can always just detect it via user-agent anyway? I'm not sure, curious if others have thoughts!


Google has checks in place to see if someone is serving things to GoogleBot differently that the rest of the users. So it almost definitely has bots that double checks pages without the user-agent.

If the "disguised" googlebot is the same as the actual one, chances are it is since it would want to be as close as possible to not flag false positives, and use the same seed for consistency then you might be able to use that to avoid detection on the fact that you are serving google something different than normal users.

Newspaper used/do that to be able to have their full article content indexed while serving a paywall to everyone else.


It is much easier to detect Googlebot by IP address or User-Agent though.


I’d assume that they occasionally send requests with different IPs and user-agents to detect cloaking.


Maybe they change the behavior of their random() function as well. The deterministic random() function can very well be a red herring.


I'd guess if they're crawling purely for the purpose of detecting cloaking (which would be a much smaller-scale job than the standard Googlebot indexing), they'd just use Chrome Headless[0][1]

[0] https://intoli.com/blog/not-possible-to-block-chrome-headles...

[1] http://antoinevastel.github.io/bot%20detection/2018/01/17/de...


I doubt that they use chrome at all. They probably just fetch the page over http and let it run in a JS sandbox. hence the deterministic random function (which I assume is not there in chrome?)


2018/01/18: Chrome 41: The Key to Successful Website Rendering: "Google updated their Search Guides and announced that they use Chrome 41 for rendering."

https://www.elephate.com/blog/chrome-41-key-to-website-rende...


It is interesting that when I use Chromium 46 (which is newer than Googlebot) I get warnings from Github and Google Docs about an outdated or unsupported browser, and I get plain HTML Github pages without JS. But Google uses even older browser than I. So even Google cannot cope with browser version race and keep their browsers updated although they require this from the users.


According to that article, Googlebot is "based on Chromium 41", it doesn't actually use Chromium 41 directly, it's still a distinct, separately developed browser.

The warning you're seeing is

a) probably shown based on UA, and Googlebot has a very different UA to the Chromium browser

b) warning users who will need to browse/actively use the site. Googlebot simply parses content, so has no use for quite a lot of the active functionality on the site. As such, it will generally not need to support all of the features used by that active functionality, just barely enough to be able to read content.


I'm sure they don't use Chrome for Googlebot, that would be quite inefficient. My comment was about cloaking detection.


Maybe that's just what they want you to think! /tinfoil


Yes, but their IPs are registered to Google Inc.


Are they?


Yes. Are you implying they are not?


Yes? I mean, scanning for cloaking only from their own IP range sounds ridiculous.


I think that cloaking is very difficult to detect automatically without using humans. You have to distinguish for example between a paywall and real cloaking.


I don't think it's particularly easy to fix without fixing a lot of other things. Let's see how much time it takes them. :)


Perhaps for performance or testing reasons they compiled V8 with a predefined seed, keywords are "V8", "mksnapshot", "--random-seed". It should not be too difficult to fix it if there are no artificial restrictions by use cases.


Every random() function is kind of deterministic, but still very interesting discovery!


Totally down nitpick alley yeah, but: Not if fed by an external source of randomness (eg cosmic rays).


> (eg cosmic rays).

YouTube copyright violations?

The chance of getting a response from Google support?


I love those. A bit easy to game if you have the right job at Google, but still this is cool.

I'd also assume that the Twitter firehose could be a great source of randomness.


Since we're nitpicking; deterministic effectively means same input produces same output. Input includes the seed. All PRNGs are deterministic.


Javascript does not allow you to set the seed for the default PRNG. So for our purposes it's essentially non-deterministic. You need to use a custom library if you want to be able to specify your seed.


Continuing the nitpicking: "JavaScript" isn't actually what prevents setting the seed, you're talking about a common implementation detail of specific JavaScript runtimes (Mozilla’s documentation says their seed can't be changed, for instance). A runtime developer could, as Googlebot apparently does, seed the prng with a constant value. You could also use some user-provided constant as the seed. All three are compliant with the letter of the spec.

Bottom line: the PRNG is, as all PRNGs are, deterministic. The user-facing math.random API using the underlying PRNG may or may not be. Those are distinct things.

From the spec's description of math.random with regard to random or pseudo-random: "...using an implementation‑ dependent algorithm or strategy."


Yeah, as was pointed out in /r/programming, it is perhaps an imperfect title in hindsight!


Technically true, but pointless pedantry when I'm sure that most people are on the the same page wrt pseudorandomness. Typically what I'd expect of /r/programming.


What's a plausible real world use case for this if one wanted to exploit this to game SEO? Can it even be exploited in any way?


OP here. I made a silly function that identifies Googlebot, but has 'plausible deniability' built in:

http://www.tomanthony.co.uk/fun/googlebot_puzzle.html

The idea being you could use a function to always send Googlebot down one execution path and users down another, and make it look like you are doing an AB test on a small set of traffic. You could then do something nefarious, such as add spammy content to the page for Googlebot.

However, in reality it is not likely to be a viable tactic for any decent quality site, and unlikely to have much of an impact on lower quality sites that may be willing to risk it.

It is an interesting idea though, and we research this sort of thing such that we can better identify such behaviours in competitors of our clients.


You can detect Googlebot by IP address or User-Agent which is much easier.


Yes you can - but it lacks the 'plausible deniability'. Especially if you have it in your frontend code. :)


After doing seo for 10 years, I assure you that plausible deniability will not give you an advantage. Google will punish your site and not listen to your arguments.


Well, presumably some people want to pose as Google to view content which is reserved to logged in members or bots, in which case the user agent is not advisable.. but indeed, IP address.


Quick thought: Your pagerank score is higher if your page loads faster. If you're 100% confident it's the googlebot accessing your site, you could remove all sorts of content from the page being presented to the bot so that it appears like the page is quicker than it will actually appear to real users.

On another note, this might be a neat way to avoid or deliberately screw up Googles "knowledge scraping" thing that they're doing which is pissing people off mightily (where when you search for a question, Google gives you the direct answer that they've scraped from another site.)


For those (like me) who are not that familiar with Javascript, the Javascript spec for Math.random says: "....The implementation selects the initial seed to the random number generation algorithm; it cannot be chosen or reset by the user." Furthermore, the seed usually changes. It seems that Google has modified their Javascript library, perhaps by allowing an explicit Math.Seed function.


I employed a seed-based deterministic random function in a WebWorker once, to noisily but predictably drive an animation. I suppose one could use the same approach to have a decent non-deterministic source alongside Google's patched seedrandom.


I guess they used the XKCD method of random number generation: https://xkcd.com/221/

It is actually truly random if you have a fair dice.



I still remember this comic strip 16 years after I saw it in the paper.


... assuming that this function is called at most once.


In an infinite string of random integers, 4 appears consecutively an arbitrary number of times.

There's no way to prove, of which I am aware, that a string of 24 fours is not random.


But given a pseudorandom number generator with a known amount of internal state (eg a 32 bit seed) there are hard upper limits on the lengths of sequences like that that can plausibly be produced.

Of course, to get 4 out of a PRNG you probably have to ask it for a random number in a range - if you’re requesting numbers in the range 0-1000 then you would expect fewer long sequences of fours than if you request in the range 0-5. And you can get arbitrarily many 4s by requesting numbers in the range 4-4...


Randomness describes a process, not its results.

You can't prove that a string of 24 fours is not random, and the fact that some process returned 24 fours once is not proof (but is some evidence) that it's not random - however, a process that always returns 4 is not random.

A process that returns a single fixed number (which was chosen by a fair dice roll) once is a random process. Using that same process twice or more is not.


Tangential but related: Brownian motion (i.e. continuous-time random walk) is recurrent in one and two dimensions -- it hits zero infinite times, but in three dimensions it is not.

The way my professor explained: a drunkard can always find his way back to his building, but not to his apartment.


> The way my professor explained: a drunkard can always find his way back to his building, but not to his apartment.

That might be true for a drunk Kitty Pryde, but is it true for a drunk ordinary human who is constrained to only move between floors via the stairs or elevator?


Welp, a 1-to-1 correspondence can be made between Brownian motion on a 1-dimensional manifold and Brownian motion in plain \mathbb R.


def roll(): return 4


Python version of: https://xkcd.com/221/ ?


This makes me wonder if Chrome Headless has enough entropy for random() if running on a Linux server.


Entropy is not a problem, surely! The choice to return deterministic values for random() is a feature, not a bug. When you are crawling pages, looking for meaningful content changes and new links you probably don't want your random() creating noise.


More than enough. Chromium takes only 128 bits from /dev/urandom to seed a V8 isolate. You can also use a fixed seed like Googlebot by passing the command line flag

    --js-flags="--random_seed=12345" 
to Chromium.


PRNGs are deterministic. Even the PRNG shipped in processors available through the RDSEED/RDRND instructions is deterministic.

Unless you are using some form of entropy, e.g: dedicated hardware, that will be the case.


The article doesn't simply say it's deterministic but the seed doesn't change.


The google queries for "roll a dice" and "flip a coin" aren't actually random either. They seem to be based off the current time.


Could you add a lot more detail than just a blank statement like this?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: