My ad server collected multiple tracking pixels -- for various events and after five pixels I could fingerprint the browser (Firefox, Chrome, MSIE, etc) and identify someone fiddling with the user agent string, or using a proxy server to mask this information.
Just to throw out a different approach: buy the most popular computer and use the most popular browser. Don't change the user agent or make any other advanced user changes.
Going to extreme measures to be untraceable is like wearing a Ghillie suit to the airport.
IMO, one of the most difficult fingerprinting attacks to defend against is the installed fonts list. I wish incognito mode only made a standardized list of fonts available.
Firefox is experimenting with a font whitelist as part of upstreaming of Tor features into Firefox trunk. These features are controlled by the privacy.resistFingerprinting pref. resistFingerprinting has some webcompat issues, but Mozilla is considering a subset of the protections for Privacy Browsing mode, similar to Tracking Protection. Here is the work-in-progress bug:
I wonder if the fontSettings section of the Chrome extensions API[1] can be used to defend against that. By randomly and deliberately poisoning that info for when in Incognito mode (or even regular browsing).
• Active countermeasures against cookie-based retargeting
• Popular enough market that merely having an iPhone in a geographic area doesn't single you out
There's a guy who downloads every page with curl. I see him on web logs. I think he must have some script that parses the amp pages out and does something with it, but because he's the only person in that geographic region who browses the web with curl, he's very easy to spot from a tracking perspective. On the other hand, because he's using curl, I don't think anyone wants to bother trying to show him an ad.
Stallman would probably use GNU wget which is licensed under the terms of the GPL, while curl is licensed under a license derived from MIT/X consortium.
I've never heard him say anything bad about running MIT licensed software.
On the other hand, I think I've read a FAQ where he says he views web pages by having a script fetch the pages with wget and them emailing them to him, but I'm pretty sure that's simply because wget is free and does the job (being able to recursively download the necessary resources), not because it's GPL licensed.
I generally do not connect to web sites from my own machine, aside from a few sites I have some special relationship with. I usually fetch web pages from other sites by sending mail to a program (see https://git.savannah.gnu.org/git/womb/hacks.git) that fetches them, much like wget, and then mails them back to me. Then I look at them using a web browser, unless it is easy to see the text in the HTML page directly. I usually try lynx first, then a graphical browser if the page needs it (using konqueror, which won't fetch from other sites in such a situation).
I occasionally also browse unrelated sites using IceCat via Tor. Except for rare cases, I do not identify myself to them. I think that is enough to prevent my browsing from being connected with me. IceCat blocks tracking tags and most fingerprinting methods."
I don't believe it's RMS because this user appears to "browse" the web interactively. A lot of the sites he hits are on HN so I think he's a user here.
I used to browse with lynx till few years ago. I am not sure if it is practical to do that anymore. Most of the web is no longer friendly for text browsing.
Somewhat suprisingly, youtube is not too bad. I've used elinks to get a URL and then put that into youtube-dl. I suspect that any site that is designed to be friendly to a (vision impaired) screen reader is going to be friendly to text based browsing.
I'll second that. As someone who once worked on fingerprinting scripts I can tell you that iPhones of the same model are basically indistinguishable to Javascript.
Generally speaking, you _can't_ prevent a page from being able to tell what browser you're using. Even with JavaScript completely disabled, there's probably still some quirks with the way different browsers handle CSS or image loading that would give away that information. Even curl can be "fingerprinted" this way, because curl is one of the only "browsers" that doesn't process CSS or images at all.
If you just want to prevent yourself from being identified as an individual, that's a different problem. Tor browser does a pretty good job of solving that.
Use distinct browsers (or profiles) for browsing web-of-documents, and using apps hosted on the web-as-application-distribution-platform. They're each (for many of us) legitimate uses, but have different requirements and threat models.
Cons: interactive infographics and courseware don't fit neatly in either.
Well, depending on whether a script tag gets its own context, and each context gets the same seed (in which case you're boned?), you could use an extension of some sort that runs a random number (through your own pseudo random function with time and/or page location as seed inputs) of Math.random() calls to mix up the results.
Or just use an extension that replaces Math.random() with something more random, but it's possible that could cause weird performance problems on certain pages and it would be hard to debug.
It's an arms race... there is so much to fingerprint ... usually people that deploy fingerprinting have more of an issue deciding which tasty morsel of bits to go for first ...
Seems reasonable to me. My guess is that it's not performance, but rather predictability, that matters here. Being able to detect when a page meaningfully changes is probably useful for Google, and a good implementation of Math.random() would potentially thwart that. Especially seeing how many pages have the magic constant in them...
Also, probably useful for determining two pages are the same, which may be needed to help prevent the crawler from crawling a million paths into a SPA that don't actually exist, for example.
If I had my guess, my guess would be Googlebot simply disables the API. It's new enough that this would be reasonable, and executing real crypto in context of Googlebot is probably rarely desirable.
Seems like the expectation on PRNGs being expressed in this thread is a bit unrealistic. They're always deterministic. The fact that this PRNG is also always seeded the same makes it easier to fingerprint, but that has no bearing on whether the PRNG is deterministic.
That would be terrible! I rely on the deterministic behavior of PRNGs all the time. For instance, I often generate random test vectors. If I have a failure, I want it to be reproducible so I can fix it. And it is, as long as I supply the same seed.
So you found another use for PRNGs. Random should be close to real world multi-side dice with seed entropy used from multiple sources such as video card, nic, and storage buffers.
There are lots more examples: heuristic optimization, discrete event simulation, sampling... I could go on. Deterministic RNGs are much better in all of these applications, where reproducibility of results is important. I'm sure nondeterministic RNGs have important uses too. Perhaps you'd care to describe some of them.
I know you're being facetious because the OP isn't correct that deterministic PRNGs aren't useful, but any cryptographic application of a PRNG should be non-deterministic.
Well, crypto isn't my specialty, so I always tread lightly when commenting about it. My hand-wavey understanding is that CSPRNGs are deterministic but not predictable, and that actual entropy is used to seed them. Going even further out on a limb, I think this is supposed to be the difference between /dev/urandom (CSPRNG) and /dev/random (actual entropy.) If I have that wrong, I'd appreciate correction by somebody in the know.
You're correct that CSPRNGs themselves are deterministic. I probably just misread what you were saying. As for dev/urandom vs /dev/random that's not really true. On Linux there's kind of a historical artifact of why they're different (blocking vs non-blocking API) but on OSX /dev/urandom is a symlink to /dev/random.
You're right, I was just referring to the ability of controlling the seed value. In simulations you have a deterministic seed to be able to reproduce issues. For CSPRNG you want a non-deterministic random seed & so CSPRNG don't even offer an API to set the seed.
Regardless of other technical limitations I'm guessing that it's actually done on purpose, as point #3 in TFA states:
>Predictable – Googlebot can trust a page will render the same on each visit
It's probably important for Google's crawler to identify whether a page changed or not, if some elements in a page are randomly generated they may want to limit the impact.
I mean, after all they seem to use a real, changing value for their Date, so if they wanted they could just seed their RNG with that.
Under the assumption that the same event fired from the same IP at the same time, with the same environment, etc. would be considered a duplicate in your system, I'd have designed this system to be idempotent-insert-only and use content-addressing instead of nonces for identity (event ID = suitably large hash of event data to avoid collision). If that assumption doesn't hold, then add your nonce to the event data (and thereby modify the hash).
I think how easy it is for Google to fix to make detecting googlebot this way harder depends on why Googlebot is doing it in the first place, which we don't really know. If it's done for performance reasons, or for predictability reasons (rendering the same page twice guaranteed or at least more likely to produce the same result), it might be difficult to change without cost.
But I believe Googlebot always faithfully sends it's user-agent. Is there a reason Google would care about 'fixing' this to make Googlebot harder to detect via random() predictability, when you can always just detect it via user-agent anyway? I'm not sure, curious if others have thoughts!
Google has checks in place to see if someone is serving things to GoogleBot differently that the rest of the users. So it almost definitely has bots that double checks pages without the user-agent.
If the "disguised" googlebot is the same as the actual one, chances are it is since it would want to be as close as possible to not flag false positives, and use the same seed for consistency then you might be able to use that to avoid detection on the fact that you are serving google something different than normal users.
Newspaper used/do that to be able to have their full article content indexed while serving a paywall to everyone else.
I'd guess if they're crawling purely for the purpose of detecting cloaking (which would be a much smaller-scale job than the standard Googlebot indexing), they'd just use Chrome Headless[0][1]
I doubt that they use chrome at all. They probably just fetch the page over http and let it run in a JS sandbox. hence the deterministic random function (which I assume is not there in chrome?)
2018/01/18: Chrome 41: The Key to Successful Website Rendering: "Google updated their Search Guides and announced that they use Chrome 41 for rendering."
It is interesting that when I use Chromium 46 (which is newer than Googlebot) I get warnings from Github and Google Docs about an outdated or unsupported browser, and I get plain HTML Github pages without JS. But Google uses even older browser than I. So even Google cannot cope with browser version race and keep their browsers updated although they require this from the users.
According to that article, Googlebot is "based on Chromium 41", it doesn't actually use Chromium 41 directly, it's still a distinct, separately developed browser.
The warning you're seeing is
a) probably shown based on UA, and Googlebot has a very different UA to the Chromium browser
b) warning users who will need to browse/actively use the site. Googlebot simply parses content, so has no use for quite a lot of the active functionality on the site. As such, it will generally not need to support all of the features used by that active functionality, just barely enough to be able to read content.
I think that cloaking is very difficult to detect automatically without using humans. You have to distinguish for example between a paywall and real cloaking.
Perhaps for performance or testing reasons they compiled V8 with a predefined seed, keywords are "V8", "mksnapshot", "--random-seed". It should not be too difficult to fix it if there are no artificial restrictions by use cases.
Javascript does not allow you to set the seed for the default PRNG. So for our purposes it's essentially non-deterministic. You need to use a custom library if you want to be able to specify your seed.
Continuing the nitpicking: "JavaScript" isn't actually what prevents setting the seed, you're talking about a common implementation detail of specific JavaScript runtimes (Mozilla’s documentation says their seed can't be changed, for instance). A runtime developer could, as Googlebot apparently does, seed the prng with a constant value. You could also use some user-provided constant as the seed. All three are compliant with the letter of the spec.
Bottom line: the PRNG is, as all PRNGs are, deterministic. The user-facing math.random API using the underlying PRNG may or may not be. Those are distinct things.
From the spec's description of math.random with regard to random or pseudo-random:
"...using an implementation‑
dependent algorithm or strategy."
Technically true, but pointless pedantry when I'm sure that most people are on the the same page wrt pseudorandomness. Typically what I'd expect of /r/programming.
The idea being you could use a function to always send Googlebot down one execution path and users down another, and make it look like you are doing an AB test on a small set of traffic. You could then do something nefarious, such as add spammy content to the page for Googlebot.
However, in reality it is not likely to be a viable tactic for any decent quality site, and unlikely to have much of an impact on lower quality sites that may be willing to risk it.
It is an interesting idea though, and we research this sort of thing such that we can better identify such behaviours in competitors of our clients.
After doing seo for 10 years, I assure you that plausible deniability will not give you an advantage. Google will punish your site and not listen to your arguments.
Well, presumably some people want to pose as Google to view content which is reserved to logged in members or bots, in which case the user agent is not advisable.. but indeed, IP address.
Quick thought: Your pagerank score is higher if your page loads faster. If you're 100% confident it's the googlebot accessing your site, you could remove all sorts of content from the page being presented to the bot so that it appears like the page is quicker than it will actually appear to real users.
On another note, this might be a neat way to avoid or deliberately screw up Googles "knowledge scraping" thing that they're doing which is pissing people off mightily (where when you search for a question, Google gives you the direct answer that they've scraped from another site.)
For those (like me) who are not that familiar with Javascript, the Javascript spec for Math.random says: "....The implementation selects the initial seed to the random number generation algorithm; it cannot be chosen or reset by the user." Furthermore, the seed usually changes. It seems that Google has modified their Javascript library, perhaps by allowing an explicit Math.Seed function.
I employed a seed-based deterministic random function in a WebWorker once, to noisily but predictably drive an animation. I suppose one could use the same approach to have a decent non-deterministic source alongside Google's patched seedrandom.
But given a pseudorandom number generator with a known amount of internal state (eg a 32 bit seed) there are hard upper limits on the lengths of sequences like that that can plausibly be produced.
Of course, to get 4 out of a PRNG you probably have to ask it for a random number in a range - if you’re requesting numbers in the range 0-1000 then you would expect fewer long sequences of fours than if you request in the range 0-5. And you can get arbitrarily many 4s by requesting numbers in the range 4-4...
You can't prove that a string of 24 fours is not random, and the fact that some process returned 24 fours once is not proof (but is some evidence) that it's not random - however, a process that always returns 4 is not random.
A process that returns a single fixed number (which was chosen by a fair dice roll) once is a random process. Using that same
process twice or more is not.
Tangential but related: Brownian motion (i.e. continuous-time random walk) is recurrent in one and two dimensions -- it hits zero infinite times, but in three dimensions it is not.
The way my professor explained: a drunkard can always find his way back to his building, but not to his apartment.
> The way my professor explained: a drunkard can always find his way back to his building, but not to his apartment.
That might be true for a drunk Kitty Pryde, but is it true for a drunk ordinary human who is constrained to only move between floors via the stairs or elevator?
Entropy is not a problem, surely! The choice to return deterministic values for random() is a feature, not a bug. When you are crawling pages, looking for meaningful content changes and new links you probably don't want your random() creating noise.
More than enough. Chromium takes only 128 bits from /dev/urandom to seed a V8 isolate. You can also use a fixed seed like Googlebot by passing the command line flag
Ads often have something like this attached to the end:
My ad server collected multiple tracking pixels -- for various events and after five pixels I could fingerprint the browser (Firefox, Chrome, MSIE, etc) and identify someone fiddling with the user agent string, or using a proxy server to mask this information.[1]: http://geocar.sdf1.org/browser-verification.html