Some analysis of the 1M most popular sites on the web

jwr · on July 24, 2015

Likely culprits are "performance analyzers" that grade a website and report an "F" (failing) grade for not using CDN-hosted common libraries.

This is a red herring: this idea that the user will already have a cached copy of CDN-hosted jQuery is bogus. Even for a common library like jQuery: the number of versions of jQuery that are in use is likely above 50, and the number of popular CDNs that host jQuery is surely above 10. So we are hoping that the user will have a cached copy of that exact jQuery version from that exact CDN.

This is somewhat similar to the situation we have with operating systems: we created shared libraries to save disk space and memory. These days using them is pretty much pointless and incurs a performance penalty, yet everybody still uses them.

For JavaScript, a much better approach is to a) make code Google Closure-compatible, b) compile everything using advanced mode into a single JavaScript file. That way you get an optimized subset of all the code that the site actually uses (this works wonders for ClojureScript apps). Most sites probably use less than 10% of jQuery, so why include all of it?

pyre · on July 24, 2015

> These days using them is pretty much pointless and incurs a performance penalty, yet everybody still uses them.

Would you rather than when (e.g.) there is a security patch for OpenSSL, that you have to wait for all software using OpenSSL to deploy updates? Or would you rather that one update to OpenSSL (likely from your OS vendor) fixes all of the software depending on it?

Edit: People seem to be commenting to this through the lense of CDNs and JavaScript, but the sentence previous to the one I quoted was:

> This is somewhat similar to the situation we have with operating systems: we created shared libraries to save disk space and memory.

Which is not talking about CDNs and JavaScript, but shared libraries on your desktop. I'm not saying that all usage of shared libraries is valid. I'm just saying that to toss out the concept as entirely useless (and having no redeeming value) in a modern setting varies from the truth.

xorcist · on July 24, 2015

Google doesn't back-port fixes to JQuery.

You can link without specifying the version number, but then you don't get full caching, so it's not common in practice.

cbsmith · on July 24, 2015

You get pretty close to full caching... often better than using your own copy. The reason it isn't a common practice is more about potential bugs caused by newer versions.

losername · on July 24, 2015

> This is a red herring: this idea that the user will already have a cached copy of [open-ssl] is bogus

He says this is because of many different versions in use

While this isn't true for a managed repository of software, it is still true for most software releases so the mismatch just might happen further down the line.

patrickmn · on July 25, 2015

> Would you rather than when (e.g.) there is a security patch for OpenSSL, that you have to wait for all software using OpenSSL to deploy updates?

Would you rather that a compromised jquery.js at <insert CDN provider> affect a huge number of sites? ;)

A_Beer_Clinked · on July 24, 2015

I did a little searching to expand on your numbers. What follows is not scientific.

According to https://www.datanyze.com/market-share/cdn/ CloudFront, Akamai, MaxCDN, CloudFlare, EdgeCast and CDNEtworks account for ~75% of CDN usage by the Alexa top 1M (with 28 others listed)

Data from http://trends.builtwith.com/javascript/jQuery suggests that the 1.4.2, 1.7.1, 1.7.2, 1.8.3 versions of JQuery are cover 53% of the 23M the have version data for (with 23 others listed)

That puts a lower bound of 690 on the number CDN-Version pairs in the wild.

If we make the (totally unsupported!) assumption that the distribution of versions is the same across all CDNs than 20 of these CDN-Version account for ~25% of the versions.

This could suggest that there is a cache advantage towards using Jquery 1.4.2 (21%) served by Akamai(37.5%)

Seems like jacquesm should now have the data to test this and give us a actual answer.

Paul-ish · on July 24, 2015

>This is a red herring: this idea that the user will already have a cached copy of CDN-hosted jQuery is bogus. Even for a common library like jQuery: the number of versions of jQuery that are in use is likely above 50, and the number of popular CDNs that host jQuery is surely above 10. So we are hoping that the user will have a cached copy of that exact jQuery version from that exact CDN.

I wonder if it might be a good idea to have a hash attribute for external resources. For example, I might include jquery by including

  <script src="//code.jquery.com/jquery-1.11.3.min.js" sha256="ecb916133a9376911f10bc5c659952eb0031e457f5df367cde560edbfba38fb8"></script>

I calculate the hash on my end. This ensures that if code.jquery.com/jquery-1.11.3.min.js is changed, the browser can know that the resource was tampered with or the developer made a mistake, and not load that resource. Also, if the browser sees a hash for a resource it has cached, it can load that cached resource, even if it is hosted at a different location. This seems better for both security and performance, but does put a slightly higher burden on the developers.

EDIT: This seems to cover what I am talking about: http://www.w3.org/TR/SRI/

EGreg · on July 24, 2015

I wrote this a few years ago

Wish it was implemented

https://news.ycombinator.com/item?id=2023475

pradn · on July 24, 2015

Can you please elaborate how using share libraries is "pretty much pointless and incurs a performance penalty"? That goes against my intuition of how they work.

bcg1 · on July 24, 2015

Think it means that most machines are not memory or disk constrained these days, but there is extra processing to perform the dynamic linking. Sort of a cost-benefit argument it seems. Doesn't address the issue of security etc and the benefit of just having 1 instance of a library to update when maintaining a complete system however.

jwr · on July 25, 2015

[I should not have brought shared libs into this, I now regret it, because it sidetracked the entire discussion, but...]

The extra processing is more significant than most people think. The library has to be compiled as relocatable code, which incurs a runtime performance penalty. You also lose a register, which especially on register-constrained architectures is really bad (it was a tragedy on iA32, it's less of an issue now).

tedunangst · on July 24, 2015

Eh, I can fit one copy of libc in L2 cache, but not 30 copies.

mwcampbell · on July 24, 2015

What if we add kernel same-page merging to the mix? Might still be a little less efficient at run time than the optimal use of shared libraries. But shared libraries make packaging more complex, especially if one does it Debian-style, with each shared library in its own package, a separate -dev package, etc.

cbsmith · on July 24, 2015

> What if we add kernel same-page merging to the mix? Might still be a little less efficient at run time than the optimal use of shared libraries.

Might be a whole lot less efficient than even sub-optimal use of shared libraries. A optimizing linker pulling together static libraries is going to make page-merging the executable almost impossible.

vvanders · on July 24, 2015

I remember reading about http2 that using a single javascript file is an anti-pattern since http2 has smarter management of requests and can deal with them in a more granular manner.

Semiapies · on July 24, 2015

That'll matter when anyone's actually requesting or serving pages using that protocol.

nostrademons · on July 24, 2015

Chrome, Firefox, and Opera all support it, as do Google, Twitter, Akamai, Jetty, Apache, and several others:

https://github.com/http2/http2-spec/wiki/Implementations

https://en.wikipedia.org/wiki/HTTP/2

That's a big chunk of the Internet right there. IE 11 and Safari 9 both support it, so once their respective betas go public that's the rest of the client-side support. Nginx is supposed to support it by the end of the year; once that happens most sites will get it just by tweaking a config file:

https://www.nginx.com/blog/how-nginx-plans-to-support-http2/

Semiapies · on July 24, 2015

Only for TLS or only in beta versions. It's still going to be awhile before it's worth it to sabotage older browser performance, even once sites update their servers.

And in the end, that'll just embolden sites to crust their pages with more analytics and trackers until the performance isn't any better.

rictic · on July 24, 2015

It is only for TLS, and of course your server needs to support it.

However it's definitely not just beta versions. Check it: http://caniuse.com/#search=http2

If you broaden that out to http2 and its very similar predicessor spdy then the browser support graph looks even better, including the latest versions of Safari, Mobile Safari, and IE: http://caniuse.com/#feat=spdy

>82% of US traffic supports SPDY or better.

cbsmith · on July 24, 2015

> Only for TLS or only in beta versions.

Actually no. If you advertise in your headers that you support SPDY/HTTP2, they'll use it even if they are not using encrypted http in the first request. Anyone who hasn't updated their servers to support it can't honestly claim they care a lot about performance.

> It's still going to be awhile before it's worth it to sabotage older browser performance, even once sites update their servers.

It is already worth it really, particularly when you factor in mobile where performance is a bigger.

> And in the end, that'll just embolden sites to crust their pages with more analytics and trackers until the performance isn't any better.

Well, there is a natural equilibrium that we tend to arrive at, but at least 1st party trackers are so lightweight with SPDY as to be irrelevant. If you have one 16K image somewhere on the page, the overhead of loading 100 trackers will seem negligible (JavaScript might be another matter though ;-).

jacquesm · on July 24, 2015

How would the closure compiler figure out what bits and pieces of the library are triggered from the html portion of the site? (I can see how it can track the javascript bits but unless your site is entirely generated from js you'd have to start with the html)

cbhl · on July 24, 2015

You annotate methods in Google-JS-Closure with @public, @protected, and @private in comments.

Public methods get unmangled symbols. Everything else gets renamed to a short name to save bandwidth.

Dependencies are specified with goog.require.

Anything that doesn't get required with goog.require or isn't called by a public function gets culled.

euroclydon · on July 24, 2015

While your integrating Closure Compiler's Advanced Mode, you might as well re-write your entire client side code... Because you'll likely have to.

jwr · on July 25, 2015

Well, it's something we should look at, especially the library makers.

People who write ClojureScript regularly encounter this: we get this optimized and trimmed down app, and then something needs jQuery, so we have to pull all of it in. After you've used advanced compilation for a while, it feels downright dirty and wasteful to pull in entire blobs of code, not just the function trees you actually need.

nostrademons · on July 24, 2015

You can also define a separate externs file with a list of symbols that shouldn't get mangled; this is useful if eg. you're using a third-party library that you don't want to modify.

jwr · on July 25, 2015

Google Closure compiler looks at all of your JavaScript at once and throws out whatever isn't actually used. So, if you call your JavaScript from HTML somewhere else, you need to explicitly list the functions you use.

ryanlol · on July 24, 2015

Using common CDN-hosted jQuery is a ridiculous idea security-wise anyway.

cbsmith · on July 24, 2015

Check your browser cache sometime. You no doubt have most of the CDN-hosted jQuery's in your cache. You also probably have hundreds of copies of Closure compiled jQuery. ;-)

anantzoid · on July 24, 2015

Minifying your JS and CSS files is a very good practice as it's not only secure, but also is compact. Grunt is a very powerful tool that does this.

cbsmith · on July 24, 2015

> Minifying your JS and CSS files is a very good practice as it's not only secure, but also is compact.

If you think you gain much in terms of compactness, you might not understand how the subsequent gzip compression works. ;-) There may well still be a gain, but it won't be significant.

jwr · on July 25, 2015

I wish more people understood how Google Closure advanced compilation works. It's not just minification. See https://developers.google.com/closure/compiler/docs/compilat...

pyre · on July 24, 2015

As an example, on a previous project all of the javascript libraries + the app concatenated together at 3MB. Minification with uglify reduced the size to 1.5MB and using Gzip compression further reduced the transfer size to ~800K.

Semiapies · on July 24, 2015

Have you compared that to the transfer size of gzip without a minification step? I haven't seen a real difference, myself, and I've been considering taking the minification step out.

gingerlime · on July 24, 2015

I wonder what's considered external though. If I compile / minify my javascript and CSS and then use a CDN to cache or host it -- is this considered external?? If so, how is it different from trusting my hosting provider to host my site in the first place, or my domain provider for resolving for me? How can this analysis know whether or not the resource is external? based on dns records alone? because I can still use *.my-domain.com but point it to an external resource...

The concerns raised are valid, but I'd like to see the methodology for analyzing the data because it can definitely skew results.

jacquesm · on July 24, 2015

If the url used to fetch the file is not related to the domain the original html comes from then that would be counted as external.

You can point *.my-domain.com to an external resource but it would see that resource as still under your control.

I will post the code soon.

mixologic · on July 24, 2015

It is pretty standard practice to host assets on a "cookieless" domain you control, but not on the same domain as the original site. For example, www.example.com has all the html, but all of the images are hosted at www.images-example.com. That would skew the results considerably.

lifeisstillgood · on July 24, 2015

Why use another domain and not a sub-domain? I assume something to do with the cookie-less comment - but not clear what?

mixologic · on July 25, 2015

The main reason is sometimes you have *.domain.com authentication cookies for single sign in across a suite of sites, however you do not want those authentication cookies sent to domains that do not need authentication.

johncolanduoni · on July 24, 2015

I understand that it isn't possible to check if the external assets are hosted on a CDN bucket which is under the control of the website (but under a different domain name), but without the ability to discriminate such cases it makes your statistics on externally hosted content pretty meaningless.

jacquesm · on July 24, 2015

I don't agree with that. From the point of view of the user that content might as well have come from a third party since - just like the headless browser used in the testing - they have absolutely no way to verify that short of doing a bunch of whois lookups. And if there is one thing that a user should be able to verify then it is that the entity sending them the main page is the same entity as the one that sends them the rest of the stuff on that page and to refuse all or part of the transaction if that isn't the case.

After all: there is only one slot in the URL bar, which strongly suggests to the user that that is the entity they are transacting with.

What technical tricks are pulled behind the scenes have no bearing on that.

johncolanduoni · on July 24, 2015

> From the point of view of the user that content might as well have come from a third party since - just like the headless browser used in the testing - they have absolutely no way to verify that short of doing a bunch of whois lookups.

How many users, even among the extremely security conscious, do you expect to actually verify such things? And among these, how many do you expect will turn their noses up at a CDN url that is serving a number of obviously website-specific assets in addition to jQuery and friends? The real issue (and the one you focus on in your article) is the actual dangers you expose yourself to by using uncontrolled external asset sources. If I set up a CloudFront bucket that mirrors the static asset directory on my web server (and use it via HTTPS), what is the vulnerability? CloudFront could be hacked, but so could my web server (and the later is a more likely culprit since it is executing more dynamic code). The difference makes even less sense if my website is also hosted by Amazon.

If I want to keep my static assets on a CDN (and save myself a lot of server load) while keeping my assets URLs on the same domain, the only two things I can think of are:

1. Using a CNAME record, which precludes using HTTPS (introducing a real, not perceived, security vulnerability). 2. Use a local url that gives a 301 redirect to my CDN url. Which means extra requests to my website, for no real security benefit.

corford · on July 24, 2015

What about the common trick of serving static assets via a separate domain (one that's still owned and hosted by you) to avoid the overhead of cookies from the main site being included in asset requests?

jacquesm · on July 24, 2015

That's a good point. That would require some more work to rule out, for instance by doing a whois lookup to see if the domains have the same controlling entity.

corford · on July 24, 2015

Yep, it's a tricky one to sort out. Harder still if the domains are using a whois privacy protection service.

Checking if both domains resolve to the same IP could help establish a link in certain cases (e.g. where the same haproxy load balancer is terminating traffic for both sites).

Maybe also checking if the asset domain contains the string of the main site domain e.g. company.com and companystatic.com or company.com and companyassets.com

Edit: if both sites are served over SSL you could also sniff the cert to see if its the same one in both cases (i.e. they're using one cert and multiple subjectAltName entries).

leeoniya · on July 24, 2015

i have always been irked by the fact that my bank and most websites behind an https connection (especially after login) have ANY external resources. this is a major flaw on numerous levels; how such a thing is still allowed by browser vendors is IMO borderline voluntary negligence. all such sites, if they must host external resources, should do so only within an iframe w/sandbox [1]. the fact that the external resource is also "https" [on some foreign property] is completely and utterly meaningless.

[1] http://www.w3schools.com/tags/att_iframe_sandbox.asp

jacquesm · on July 24, 2015

My bank (ABN/Amro in NL) is even worse, they not only include external resources, they include external resources that are critical for their site to function, in other words, if I disable the various trackers and analytics elements on their page the site simply no longer works. You'd expect the opposite!

leeoniya · on July 24, 2015

wow! i think there should be a place on the internet for publicly naming and shaming such practices. like a Darwin Award or Razzies [1] of webdev.

[1] https://en.wikipedia.org/wiki/Golden_Raspberry_Awards

jacquesm · on July 24, 2015

That's one of the things I'm considering right now. To re-write the top 1000 or so with annotations and then to sort them by category as well as an example of a site that is 'clean' in the same category.

There are a ton of offenders and some of them are very well known.

One of the interesting things you find when you look at this data is that the bigger sites really do have their stuff set up better (for instance, by using in-house analytics) but there are lots and lots of exceptions and some of them are quite shocking. For instance, I've found two major car brands that include evercookies on their corporate websites in Eastern Europe, a thing that no respectable company should ever do. I suspect an ad agency is the cause of this so I'm still digging away at that.

corford · on July 24, 2015

If you want a never ending list of shocking offenders, be sure to check out most airline sites. They really are abysmal.

leeoniya · on July 24, 2015

frankly, i am much less bothered by the cross-domain visit tracking aspects than the js injection. cookies are not going anywhere in a thriving, behavioral analysis driven ad industry. i dont know how logged impressions/clicks can be trusted for accuracy if they had to be proxied by or entrusted to their customer's servers.

i will live with cookies, but absolutely will not live with injected js behind https.

mangeletti · on July 24, 2015

http://www.w3.org/TR/SRI/

jacquesm · on July 24, 2015

Indeed. I can't wait until that is implemented across the board. It still leaves the privacy issues.

neftaly · on July 24, 2015

https://srihash.org/

captainmuon · on July 24, 2015

One thing in this context is that it is basically impossible for a website to check the integrity of an external (js) resource without loading it. This is a consequence of the web security model.

Its basically impossible to get the contents of a .js file without executing it, say for checksum verification (at least without CORS, and even with, you might trigger an additional download, I haven't tested it). But it's trivially easy to include an external .js in the page, with the same access rights as directly embedded script (including access to credentials).

That's what we're used to, but it seems completely backwards to me. I would be much better IMO if a script could make arbitrary HTTP requests to other sites - but without having access to those sites' credentials. (Remember in the 2000s when "mashups" were all the rage? I spent a weekend parsing some data source in javascript to display it on a map, just to realize that what worked locally didn't work over http. Imagine the disappointment.)

What's also missing is a way to run an external script sandboxed, or in a sub-interpreter. There ought to be a way to restrict what banner ads or font loaders can do to my page.

throwaway41597 · on July 24, 2015

Look SRI up on this very page.

Web pages can make requests to other origins (GET image, script, XHR, POST to iframe, XHR). CORS allows you to read the response. But what you're asking would probably be hard to transition the whole web to without too much spam and DOS'ing.

The sandboxing for an external script you want already is feasible with an iframe with a different origin.

jefftk · on July 24, 2015

Subresource Integrity hashes [1] should let sites get the caching and CDN benefit of using shared resources like jquery without letting 3rd parties have the ability to XSS them. Basically, you can specify a hash of what the url should point to, and if it doesn't match then the load is blocked.

This isn't quite out yet: it's in Chrome trunk [2] and still under review in Firefox [3].

[1] https://w3c.github.io/webappsec/specs/subresourceintegrity/

[2] https://code.google.com/p/chromium/issues/detail?id=355467

[3] https://bugzilla.mozilla.org/show_bug.cgi?id=992096

theg2 · on July 24, 2015

We offload a ton of our scripts to S3 buckets on random unrelated domains and it's a pretty common practice. Did this take that into account?

jacquesm · on July 24, 2015

No, it did not. It would have to tie in the whois data to make that match (and even then it might not). The analysis is URL based, but I don't think changing that to account for those sites that use random domains to store chunks of their site would make a huge difference, but it's a valid criticism.

jefftk · on July 24, 2015

It seems like you're marking sites down for using a cookiless domain for resources, even though that's faster and no less secure? For example, you'd mark google down for referencing gstatic.com or facebook down for referencing fbcdn.com.

I realize there's no publicly available way to tell that yahoo.com and yimg.com are the same entity, but it would be good to at least note this as an issue with the analysis.

jacquesm · on July 24, 2015

I'll do so.

Edit: done.

jefftk · on July 25, 2015

thanks!

pjungwir · on July 24, 2015

I agree re not using externally-hosted Javascript. In fact I seem to remember a year ago Google Code having connectivity issues and jQuery all over the place failing to load. I was glad on that day that I always host my own jQuery.

Re tracking, I ran into this embedded in some webfonts CSS a project was using (downloaded from one of those font websites):

    /* @import must be at top of file, otherwise CSS will not work */
    @import url("//elided.example.com/count/35d82f");

    @font-face {font-family: 'Foo'; font-weight: 300; src: url('/webfonts/foo.eot');.....}

That @import returns nothing. It is just part of their tracking/licensing. And it was really slow! And I love the lying comment they included.

Raphael · on July 24, 2015

Well, the @import itself only works if it precedes other statements.

rufugee · on July 24, 2015

Evercookies sound terrifying. Not that I'm doing anything that I really worry about hiding, but I can't stand invasion of privacy like this.

Are there effective protections against them? If not, I wonder why the EFF hasn't taken up the charge to fight them?

schoen · on July 24, 2015

EFF is interested in this issue. It's one of the motivations behind our work on Panopticlick and Privacy Badger, for instance.

https://panopticlick.eff.org/

https://www.eff.org/privacybadger

We have also participated in meetings and discussions on tracking protection and lobbied browser developers about it.

The Panopticlick research shows that it's potentially difficult to impossible to detect and prevent persistent cross-site tracking in the current web platform by technical means. Even if we fixed every cookie-like mechanism so that no site can set and query state except through official HTTP cookies and subject to the user's cookie preferences, the sites might still be able to recognize the browsers by querying other navigator (and OS and plugin) properties. Tor Browser has been able to do a great job on that issue -- but at the cost of disabling a lot of web platform features that sites might expect.

puredemo · on July 24, 2015

Too many battles to wage for one fairly small organization

rufugee · on July 24, 2015

Then should the rest of us not step up? Or is it that there's no effective way to combat it without making major changes to browsers we have little control over? At the very least, someone could have a website which listed steps you can take to protect yourself.

schoen · on July 24, 2015

You can get some benefits from EFF's Privacy Badger and from Mozilla's Tracking Protection feature.

https://support.mozilla.org/en-US/kb/tracking-protection-fir...

https://www.eff.org/privacybadger

Both of these tools are focused on cross-site ("third-party") tracking, rather than cross-session tracking by an individual site ("first-party"). Third-party tracking is technically easier to try to detect, and some people regard it as more intrusive.

As I mentioned upthread, EFF's own research on browser fingerprinting shows that it's hard to stop all user tracking (because your browser and OS and device might be different enough from others to be unique in a population in ways that could be observable by a remote site). Tor Browser is doing great work on this

https://www.torproject.org/projects/torbrowser/design/#finge...

and I think they've made concrete progress. (I think the Tor Browser developers might say that the privacy benefits of using their changes without Tor are unclear because you could also so easily be tracked by IP address. But it's possible that some of their changes will find their way into mainline Firefox, at least as options.)

lifeisstillgood · on July 24, 2015

So, as someone who has never really bothered with blockers of any sort, what would be the ideal blocker to install / write? (I am thinking iOS as sadly that is my primary medium these days)

- able to prevent download of any third party hosted assets - able to hash the above assets and allow user to approve their use (ie can approve jquery v1.5 from cdn.google.com) - is this whitelist approach going to work? Does ghostery or similar already do this?

I vastly prefer a whitelist approach - but if 2/3 of the web will break I am at a loss ...

ktusznio · on July 24, 2015

What about services like npm that distribute code? Are these analogous or do they have additional security in place?

jacquesm · on July 24, 2015

Isn't that server side?

ktusznio · on July 24, 2015

Yes, but the same attack could happen if an attacker gains control of an npm module. Users without tight control over their modules could unwittingly pull in malicious code.

Zikes · on July 24, 2015

With dependency resolution and node_modules folders dozens of levels deep, it's pretty difficult to verify untrusted code hasn't been injected somewhere.

hexasquid · on July 24, 2015

Not really. NPM is also used with a tool called browserify to enable frontend web developers to use NPM modules in the browser.

lifeisstillgood · on July 24, 2015

some relatively serious questions on the methodology:

- how did you define third party assets vs domain-managed assets? Is anything not hosted under example.com automatically third party? What about Twitter.com and t.co? I know this one is picky but would like a feel for the figures.

- how deep did you scrape the (million!) sites? If it's front page or similar Inwould not be surprised to see figures revised upwards significantly - once off the beaten track of even major sites the number of "let this one slide" decisions spikes a lot.

- how long did polling a million sites take?! What was the setup you used - very interested even if it has nothing to do with methodology :-)

Thank you - you have at least made me rethink my lack of blockers

jacquesm · on July 24, 2015

I will release code + data for bootstrapping but until then here are my answers to your questions:

> how did you define third party assets vs domain-managed assets? Is anything not hosted under example.com automatically third party? What about Twitter.com and t.co? I know this one is picky but would like a feel for the figures.

That's based on the hosting domain being the same or a superset of the domain that the page originally came from.

> how deep did you scrape the (million!) sites?

Just the homepage.

> If it's front page or similar Inwould not be surprised to see figures revised upwards significantly - once off the beaten track of even major sites the number of "let this one slide" decisions spikes a lot.

That's true.

> how long did polling a million sites take?!

20 days. About 50K sites per day which significantly cramped my ability to do other work here.

> What was the setup you used - very interested even if it has nothing to do with methodology :-)

A simple laptop with 16G of ram and a regular (spinning) drive on a 200/20 cable connection. 40 worker threads concurrently with a simple php script to supervise the crawler and another script to do the analysis.

Most of the data was discarded right after crawling a page, only the URLS that were loaded as a result of loading the homepage were kept as well as the mime type of the result.

lifeisstillgood · on July 24, 2015

Thanks!

Two things leap out. Firstly I love the way you chose to do 1 million sites. I would have gone, hmm, maybe top thousand, and called it a representative sample :-) The scale of the modern world is still something I am grappling with.

Secondly, is that 200 Mbps down / 20 mbps up? I think the UK has some broadband access lessons to learn if that's true. My wet piece of string is getting threadbare.

jacquesm · on July 24, 2015

It's maybe overkill to do it on the whole set instead of just a sample, probably the numbers would not change all that much.

The 200/20 is indeed 200 Mbps down and 20 up, this little trick saturated the line pretty good though. I probably could have saved some time and bandwidth by letting phantomjs abort on image content but I was lazy.

lifeisstillgood · on July 24, 2015

I'm slap bang in the commuter belt round London - and broadband availability is having an actual effect on house prices and decisions to move out of the area.

It's surprisingly low on the political agenda nationwide.

I'm about to get all English Middle class over this Sinai will stop now :-)

jacquesm · on July 25, 2015

Code has been released to: https://github.com/jacquesmattheij/remoteresources have fun.

ifdefdebug · on July 24, 2015

Are modern updated browsers resilient against those "evercookies" or not so much?

To be more specific: I have my Firefox configured to delete cookies on exit. Does that deal with "evercookies"? I must admit, never heard about them before...

jacquesm · on July 24, 2015

Evercookies go a lot further than the regular cookies that you can delete per session.

rhblake · on July 24, 2015

> The request for the code contains a referring url which tells the entity hosting the script who is visiting your pages and which pages they are visiting (this goes for all externally hosted content (fonts, images etc), not just javascript)

This can now be mitigated thanks to Referrer Policy [0]:

"The simplest policy is No Referrer, which specifies that no referrer information is to be sent along with requests made from a particular settings object to any origin. The header will be omitted entirely."

Voilà:

  <meta name="referrer" content="no-referrer">

It's a W3C draft, but it's supported by latest FF/Chrome/Safari, and Microsoft Edge [1], although currently, with Edge, you'll want to use the legacy keyword "never" instead. (AFAIK "never" works with all the aforementioned browsers.)

> Google analytics junkies in particular will have to weigh whether they feel their users privacy is more important to them than their ability to analyze their users movements on the site.

There's a nice alternative - Piwik [2]. It's very much like GA, but GPL and self-hosted, and with various options for privacy [3]. You can even use it without cookies, if you don't mind the somewhat reduced accuracy and functionality.

Regarding fonts from Google Fonts, it's super-easy to host them yourself. There's a nice bash script [4] that downloads the font you want in all its formats/weights and generates the proper CSS. There's also the google-webfonts-helper service [5], and Font Squirrel has a webfont generator [6].

[0] https://w3c.github.io/webappsec/specs/referrer-policy/

[1] https://msdn.microsoft.com/en-us/library/dn904194%28v=vs.85%...

[2] https://piwik.org/

[3] https://piwik.org/docs/privacy/

[4] https://github.com/neverpanic/google-font-download

[5] https://github.com/majodev/google-webfonts-helper

[6] http://www.fontsquirrel.com/tools/webfont-generator

chadscira · on July 24, 2015

Im impressed with the amount of browser support this has already. Thanks for the info.

jcr · on July 24, 2015

jacquesm, maybe you didn't want to wade into the details too much, but you didn't mention a major attack vector on third party scripts, namely, the transparent caches run by nearly all ISPs. Also, unless a third party script is served over HTTPS to users, regularly verifying the scripts is useless since _your_ ISP will give you _their_ cached copy, and similar is true for all site users. Transparent CDN's are another consideration for related caching problem.

jacquesm · on July 24, 2015

The examples given were just examples, I can see a lot more possibilities beyond the ones mentioned in the article but to be honest I had not thought about the ISP caches.

aw3c2 · on July 24, 2015

> Flash seems to be very rapidly on the way out, less than 1% of the domains I looked at still contained flash content

What exactly did you look at? Homepages?

jacquesm · on July 24, 2015

Yes, homepages and all the content subsequently loaded (directly or indirectly through multiple layers of scripting or iframes). Essentially what you'd get if you were to visit each and every homepage on the top list and logged the urls that were loaded as a consequence of that.

aw3c2 · on July 24, 2015

As much as I hate Flash, I don't think you can infer that then. Does your analytics consider https://www.youtube.com/ using Flash? I see no Flash on it here.

jacquesm · on July 24, 2015

If it's not on the homepage then it would not consider it using flash. The analysis was run on the homepages, not on all the pages in those websites. (And that would require a lot more work on my part and likely would not change the results all that much).

I believe overall flash usage on the web is now about 10%, but larger sites are generally much better at keeping their sites up-to-date and to follow trends.

Advertising is another good indicator. The typical trick nowadays is to check if flash is installed using some javascript or header inspection and only to serve it up if support has been detected.

Websites that categorically include flash are the ones that were detected.

That's a good point though, I should update the text to that effect.

edit: ok, updated the text to be much more precise about flash usage and the conditions of the crawl which will lead to under-representation of flash.

ytdht · on July 24, 2015

maybe a random page of content would have been a better indicator

throwaway41597 · on July 24, 2015

How deep did you crawl? I would have guessed the flash usage to be higher.

How big is the dataset? How long did it take? Which tools did you use besides phantomjs?

Nice job!

jacquesm · on July 24, 2015

> How deep did you crawl?

Front pages only.

> I would have guessed the flash usage to be higher.

When adding all the pages in a site it no doubt will be. I'll update the article to clarify this.

> How big is the dataset?

In flight: huge, but after culling and keeping only the bits that I needed it was a lot smaller, about 20G.

> How long did it take?

About 10 days.

> Which tools did you use besides phantomjs?

Just some php glue scripts, nothing fancy, about 500 lines.

heyalexej · on July 24, 2015

This is very interesting. Will you release the data and code at some point?

jacquesm · on July 24, 2015

Yes, I will definitely release the code and the dataset required to bootstrap the rest. It takes a long long time to run and you'll need a good bit of bandwidth. I won't be releasing the raw data because there is simply too much of it.

adriancooney · on July 24, 2015

If bandwidth is a worry there are open data projects by Amazon[1] and I think Google that could host your data.

[1] Amazon Public Data Sets, http://aws.amazon.com/public-data-sets/

[2] Google Public Data, http://www.google.com/publicdata/directory

heyalexej · on July 24, 2015

Sorry for bugging you. Did you store results from the response¹ metadata object for every domain and process it later or use Regex to parse the HTML content?

I crawl large-ish websites (most recently https://code.google.com with 1.8MM repos) often and am really looking forward to your dataset & code.

[1] http://phantomjs.org/api/webpage/handler/on-resource-receive...

jacquesm · on July 24, 2015

> Did you store results from the response¹ metadata object for every domain and process it later or use Regex to parse the HTML content?

That would have constrained throughput too much so I opted for culling it during the crawl to just content-type and url, this was then processed to extract the various bits of information. I did use the 'resource received' trick you linked above. Very useful.

jacquesm · on July 25, 2015

Code released to:

https://github.com/jacquesmattheij/remoteresources

nvk · on July 24, 2015

Run it against Coinkite.com specially on Signed in pages.

beamatronic · on July 24, 2015

>> "50% of the domains contained advertising of some form."

That's much lower than I would have expected

enum_write3 · on July 24, 2015

[flagged]

jfb · on July 24, 2015

As opposed to all that sterling, highly-secure software built by REAL MEN with COMPUTER DEGREES, right?

n0us · on July 24, 2015

what a decidedly over pessimistic outlook.

andrewljohnson · on July 24, 2015

This article should be read as ranty research, not practical advice. It'd be fine to fix these issues, but not at the website developer level.

"If you have to use externally hosted resources such as javascript libraries then at a minimum you should verify regularly that the code has not changed "

No, you shouldn't. You should focus on stuff that matters to users, not existential internet security holes. Let someone else fix this problem for you, when it stops being existential.

By far the safest approach for website owners that care about their users and their users privacy is to simply not include anything at all from other people’s servers.

FTFY "A safe, but impractical and productivity-destroying aproach..."

jacquesm · on July 24, 2015

> This article should be read as ranty research, not practical advice. It'd be fine to fix these issues, but not at the website developer level.

That's fine, it's only your users after all.

> No, you shouldn't. You should focus on stuff that matters to users, not existential internet security holes.

If you can fix a hole until that hole is plugged in a more permanent fashion then I think that you should.

> FTFY "A safe, but impractical and productivity-destroying aproach..."

Entirely practical and with minimal impact on your productivity. If you can find the time and expend the effort to build a product to begin with then you can certainly find the time and expend the effort to make a conscious decision about things like this.

It matters enough to me that it affects my choices with respect to which parties I deal with on the internet and I sure hope I'm not the only person that thinks like that.

And every time some user gets their day ruined by some formerly javascript widget hosting domain that got transfered to a new and malicious owner there is a bit more evidence to support that that approach is the right one.

From the end users point of view as long as it works the two solutions are identical and if you're serving up megabytes of content anyway what's the big deal about serving up some scripts directly as well?

andrewljohnson · on July 24, 2015

No, the solutions are far from identical to the user.

In Jacques' universe, websites are more buttoned up, and less feature-ful. Every non-banking, non-critical website is taking worthless security steps in secur-e-verse, and hurting their product.

In the real universe, developers of new fluff websites focus on features and user experience, grow successful, and attract many users. Developers who takes Jacques approach build websites that fail and never get used by anyone.

And also no, no, no.... no one is like you. No one cares if developers do what you propose, no one avoids websites that don't.

jacquesm · on July 24, 2015

Explain to me like I'm five what features a website that hosts it's own javascript can't have versus one that loads those same javascripts from remote source?

andrewljohnson · on July 24, 2015

It can't have the features that would have been built, in the time spent learning about and implementing security.

I regard nearly all security for startup-class, low-user, and low-value companies to be premature optimization, which is deadly to a new project's potential.

jacquesm · on July 24, 2015

> I regard nearly all security for startup-class, user-less, and low-value companies to be premature optimization.

I can't see anybody working on user-less websites anyway but I sincerely hope that you'll make it plain which start-ups you work for so I can avoid them. Security and abuse potential are very important for start-ups because you have only one reputation and if you lose that you're pretty much done for.

I can point you to several pretty harsh reminders of how start-ups that don't take end-user security serious can end up.

larrys · on July 24, 2015

One thing that I have seen over many many years in business is the need to decide where and when to cut corners and take chances. I have definitely seen that after the fact you could feel "geez I should have done that how stupid" but before something actually happens things aren't so clear that resources and time and money should be spent preventing something from happening.

I have noted that if I did everything perfectly (and I am not talking specifically about money) I would never have made any money at all. (Exaggeration for effect.) There are always things that you can think of that seem like a good idea and you can't do all of them. I take the time and put in the money and effort to rotate and store offsite backups. However in all of these years that has never been needed. But God knows we all are aware of a ton of small businesses living on the edge who probably don't have good onsite backups, let alone offsite and they do have information that they need.

marcosdumay · on July 24, 2015

I see. Copying the javascript you use into somewhere on your own site does take too much time.

jfb · on July 24, 2015

Then let them die.

larrys · on July 24, 2015

Agree the advice is well intentioned and is correct (in theory according to what I read) but not entirely practical. For example:

"then at a minimum you should verify regularly that the code has not changed (you have to hope that you are looking at the same code that your users see)"

Who exactly is the "you" in the above statement and who pays the "you" money to fix this and keep on top of it on an ongoing basis? And for how long?

In the physical world the different between ideal and practical can be described by my experience with production machinery. The machinery came with guards to protect the operators from getting their hands caught or cut off. The guards also came with switches to prevent the machines from running when the covers were taken off. But what would happen is the operators would want to oil or tweak the machines so they would take the covers off and disable the sensors so that the machine would run bare. Of course you would tell them not to do this, but they would still forget to put the covers back on or be lazy quite often and there was little you could do about it. You had production to get done under deadline and weren't likely to fire someone even though you knew there was a small safety risk in doing this type of thing (older machines of course came with no safety guards at all, operators just had to be careful at their own peril. (And good operators were impossible to find anyway so once you had someone they became a primadonna ..

jacquesm · on July 24, 2015

Regularly pulling a hash for the libraries you include and alerting you when a hash changes unexpectedly is no work at all.

And if you need to be paid money to fix it then you have a problem anyway so one would assume that you'd be paid just as much to fix it when you're being alerted to it by a cron job as you would be paid to when you're alerted by a horde of users.

As for machines without guards: I've worked (extensively) in the metal working industry and the number of people missing digits and limbs has decreased steadily ever since tampering with guards, safety-interlocks and lock-outs became a firing offense so I don't think that's a very good example.

larrys · on July 24, 2015

Machines: Yes this was the 80's (sorry I didn't point that out my mistake) and things have changed. However to that point if you have your golden machine operator turning out good work (and he is only 1 of 2 on a particular line) and it's not easy to hire a replacement, let alone a good replacement, you tend to get a bit lax.

Security: I am primarily a business guy (who does some light programming and knows Unix since the 80's) so I hire others to do work for me. I am just thinking that for the people that I have hired in the past how would anyone know if any of this is happening (other than code audits) and what is the mechanism to make sure the right thing happens even if you know what the right thing is? It's kind of a version of the advice "make backups but make sure that you test your backups as well".

jacquesm · on July 24, 2015

The motto is 'trust but verify', and indeed that goes for your backups as well. And incidentally that's one of the most failed items during the dd's I've done and after verification several companies turned out to have lived without backups at all.

It usually takes two things to go wrong for a disaster to happen: some $0.05 part that fails and a procedural error.

And the consequences can be just about anything.

larrys · on July 24, 2015

One of the first books that I read talked about the story of the backup tapes on the car seat that were erased when someone in Sweden (?) with heated seats drove home. (urban legend iirc).

jacquesm · on July 24, 2015

Iirc Saab pioneered heated seats because one of their engineers had colon cancer and Saabs are pretty common in Sweden, but I'd still wager that's an urban legend because the heating is done with DC current and to reliably alter the contents of a tape you'd need a lot more of magnetic field to overcome the resistance of the magnetic particles to change direction (remanence) and you'd want that field to alternate.

lifeisstillgood · on July 24, 2015

>> Who exactly is the "you" in the above statement and who pays the "you" money to fix this and keep on top of it on an ongoing basis? And for how long?

This is not a negative! This is an upwelling opportunity for a retainer. I know someone whose business is warranting other agencies sites - he patches holes in a site they built but don't see the percentage in maintaining - he is doing well enough to hire in new folks.

I was doing hashes of jquery certainly five years ago and I suspect earlier - it's one of those rings that seems obvious in the buildscript.

lifeisstillgood · on July 24, 2015

Users assume the sites are safe and not malicious or open to attack.

There are an enormous number of implicit contracts between us as developers and our users that are not written down but are still "stuff that matters to users".

Ask a bank auditor if they think security holes does not count as "stuff that matters". Then try selling your services to said banks. It matters

rubiquity · on July 24, 2015

Your view on this screams "Tragedy of the Commons."

Glyptodon · on July 24, 2015

I hope you don't develop any websites I use.