Hacker News new | past | comments | ask | show | jobs | submit login

I hear this argument a lot, and I very much disagree. Now you have browser vendors having to device which libraries are "popular" and shipping them in the initial download of the browser.

It turns out that this technology already exists in a much better form. It's called cache. The problem is that almost everyone hosts their own version of jQuery. If everyone simply linked the "canonical" version of jQuery (the CDN link is right on their site) then requiring jQuery will be effectively free because it will be in everyone's cache.

Also the cache is supported by all browsers with an elegant fallback. Instead of having to manually having to check if your user's browser has the resource you want preloaded you just like the URL and the best option will automatically be used.

TL;DR Rather then turning this into a political issue stop bundling resources, modern protocols and intelligent parallel loading allow using the cache to stove this problem.




> If everyone simply linked the "canonical" version of jQuery (the CDN link is right on their site) then requiring jQuery will be effectively free because it will be in everyone's cache.

It's not, though. I ran this experiment when I tried to get Google Search to adopt JQuery (back in 2010). About 13% of visits (then) hit Google with a clean cache. This is Google Search, which at the time was the most visited website in the world, and it was using the Google CDN version of JQuery, which at the time was what the JQuery homepage recommended.

The situation is likely worse now, with the rise of mobile. When I did some testing on mobile browsing performance in early 2014, there were some instances where two pageviews was enough to make a page fall out of cache.

I'd encourage you to go to chrome://view-http-cache/ and take a look at what's actually in your cache. Mine has about 18 hours worth of pages. The vast majority is filled up with ad-tracking garbage and Facebook videos. It also doesn't help that every Wordpress blog has its own copy of JQuery (WordPress is a significant fraction of the web), or for that matter that DoubleClick has a cache-busting parameter on all their JS so they can include the referer. There's sort of a cache-poisoning effect where every site that chooses not to use a CDN for JQuery etc. makes the CDN less effective for sites that do choose to.

[On a side note, when I look at my cache entries I just wanna say "Doubleclick: Breaking the web since 2000". It was DoubleClick that finally got/forced me to switch from Netscape to Internet Explorer, because they served broken Javascript in an ad that hung Netscape for about 40% of the web. Grrrr....]


> The vast majority is filled up with ad-tracking garbage and Facebook videos.

There is the problem then, and the solution? I for one don't make bloated sites nilly-willy, I suck at what I do but at least I do love to fiddle and tweak for the sake of it, not because anyone else might even notice; and I like that in websites and prefer to visit those, too. Clean, no-BS, no hype "actual websites". So I'd be rather annoyed if my browser brought some more stuff I don't need along just because the web is now a marketing machine and people need to deploy their oh so important landing pages with stock photos and stock text and stock product in 3 seconds. It was fine before that, and I think a web with hardly any money to be made in it would still work fine, it would still develop. The main difference is that it would be mostly developed by people who you'd have to pay to stay away, instead of the other way around. I genuinely feel we're cheating ourselves out of the information age we could have, that is, one with informed humans.


Interesting data, thanks for sharing.

On top of that, while everyone uses jquery, everyone uses different version of it (say, 1.5.1, 1.5.2, ... hundreds of different versions in total probably).


The problem with caching is that you're sharing the referer with the canonical URL. Another problem is that you're using someone else's bandwidth. And if you combine the two, you can be sure that info about your visitors will be sold, which is why quite a lot of people would prefer to host their own versions of jQuery...


For the referrer problem you can apply a referrer policy to prevent this but unfortunately the referrer policy isn't very granular.

Also for my sites I have a fallback to a local copy of the script. This allows me to do completely local development and remain up if the public CDN goes down (or gets compromised). With small (usually) performance impact.


Couldn't the HTTP cache of your ISP be a good in-betweener, in that case? Would he send the referer to the canonical URL?


Not for sites using TLS. The only option for secure sites would be a CDN. That is, an HTTP cache in a relationship with the content publisher rather than the subscriber.


The problem with hosting JS libraries on CDNs is that the cache has a network effect.

You only gain performance if the browser already has a cached version of this specific version on this specific CDN. If you don't - you end up losing performance, because now an additional DNS lookup needs to be performed, and an additional TCP connection needs to be opened.

Here are a few reasons people choose to avoid CDNized versions of JS libraries. http://www.sitepoint.com/7-reasons-not-to-use-a-cdn/

This is a 6 year old post, but it raises some valid concerns: https://zoompf.com/blog/2010/01/should-you-use-javascript-li...


the reason why I prefer to use a cdn is because it is a games theory example come to life here - if everyone used the cdn version then any user coming to your site would most likely have the cdn version in their cache and thus performance would go up, but if you use the cdn version and your competitors don't their performance is slightly better than yours and so on and so forth. Games theory indicates that in most games of this sort cooperation is better than non-cooperation.

And really if you are using one of the major libraries and a major CDN (Google, JQuery, etc.) over time your users will end up having the stuff in the cache, either from you or from others having used the same library version and cdn.

I suppose someone has done a study on CDN spreading of libraries and CDNS among users, so that you could figure out what the chance is that a user coming to your site will have a specific library cached - there's this http://www.stevesouders.com/blog/2013/03/18/http-archive-jqu... but it is 3 years ago, really this information would need to be maintained at least annually to tell you what your top cdn would be for a library.


but there isn't 1 CDN, or 1 version... if you need two libraries, but the canonical CDN for jquery is on one, and your required extension is on another... that's two DNS lookups, connections, request cycles, etc.

So you use the one that has both, but one is not canonical, which means more cache misses. That doesn't even count the fact that there are different versions of each library, each with it's own uses, and distribution, and the common CDN approach becomes far less valuable.

In the end, you're better off compositing micro-frameworks and building yourself. Though this takes effort... React + Redux with max compression in a simple webpack project seems to take about 65K for me, before actually adding much to the project. Which isn't bad at all... if I can keep the rest of the project under 250K, that's less than the CSS + webfonts. It's still half a mb though... just the same, it's way better than a lot of sites manage, even with CDNs


that's 2 dns lookups on any user that hasn't already done that somewhere in the past and had it cached.

The question then is how likely are they to have done that in regards to your particular cdn and version of the library.

I agree that a lot of possible cdns, versions and so forth decreases the value of the common CDN approach, but there are at least some libraries that have a canonical CDN (JQuery for example) and not using that is essentially being the selfish player in a games theory style game.

Since I don't know of any long running tracking of CDN usage that allows you to predict how many people who visit your site are likely to have a popular library in their cache it's really difficult to talk about it meaningfully (I know there are one-off evaluations done in one point in time but that's not really helpful).

Anyway it's my belief that widespread refusal to use CDN versions of popular libraries is of course beneficial in the short run for the individual site but detrimental in the long run for a large number of sites.


Latency of a new request as mentioned in one of those articles is the main reason why I self host everything.

Since HTTPS needs an extra round trip to startup, it's now even more important to not CDN your libraries. The average bandwidth of a user is only going to go up, and their connection latency will remain the same.

If you are making a SaaS product that business want, using CDNs also make it hard to offer a enterprise on-site version as they want the software to have no external dependencies.


This might make sense, if all of your users are located near your web servers and you can comfortably handle the load of all the requests hitting your web servers.

If the user making the request is in Australia, for example, and your web server is in the US, the user is going to be able to complete many round trip requests to the local CDN pop in Australia in the time it takes to make a single request to your server in the US.

Latency is one of the main reasons TO use a CDN. A CDN's entire business model depends on making sure they have reliable and low latency connections to end users. They peer with multiple providers in multiple regions, to make sure links aren't congested and requests are routed efficiently.

Unless you are going to run datacenters all around the world, you aren't going to beat a CDN in latency.


If the only thing you have on the CDN is libraries, it's faster to have your site host them even if it's on the other side of the world. When HTTP2 file push is widely supported, it becomes even more in favor of hosting locally, as you can start sending your libraries right after you are done sending the initial page without waiting for the browser to request them.

If you are using a CDN for images/video, then yes, you would have savings from using a CDN since your users will have to nail up a connection to your CDN anyways.

Then again a fair number of the users for the site I'm currently working on have high latency connections (800ms+), so it might be distorting my view somewhat.


Or you use a CDN in front of your site, caching the content under your domain. But certainly something to be aware of.


This is why I recommended using the CDN recommended by the project, most do recommend a CDN, for example jQuery has it's own CDN.

As for adoption, that is very much a chicken and egg problem.


Even then, different versions will have their own misses... not to mention 3rd party libraries on another CDN means another DNS hit.

DNS resolution time is a pretty significant impact for a lot of sites.


Ya I would have to agree with you tracker. Ever 3rd party dependency is introducing another DNS lookup. The whole point behind using a CDN effectively, besides lowering latency, is to reduce your DNS lookups to a bare minimum. For example, I use https://www.keycdn. They support HTTP/2 and HPACK compression along with Huffman encoding which reduce the size of your headers.

The benefits of hosting say Google Fonts, Font Awesome, jQuery, etc. all with KeyCDN is that I can take better advantage of parallelism if I have one single HTTP/2 connection. Not to mention I have full control over my assets to implement caching (cache-control), expire headers, etags, easier purging, and the ability to host my own scripts.


What if the checksum was the same and you accepted the cache hit if the checksum agrees and get your own copy if it doesn't? Maybe the application should get to declare a canonical URL for the js file instead of the browser? So something like

<script src="jQuery-1.12.2.min.js" authoritative-cache-provider="https://ajax.googleapis.com/ajax/libs/jquery/1.12.2/jquery.m... sha-256="31be012d5df7152ae6495decff603040b3cfb949f1d5cf0bf5498e9fc117d546"></script>

Would this cause more problems than it would solve? I'm assuming disk access is faster than network access.

I'm concerned about people like me who use noscript selectively. How easy is it to create a malicious file that matches the checksum of a known file?


>How easy is it to create a malicious file that matches the checksum of a known file?

I'd say not easy at all, practically impossible.

https://en.wikipedia.org/wiki/Preimage_attack


> I'm concerned about people like me who use noscript selectively. How easy is it to create a malicious file that matches the checksum of a known file?

SHA-256? Very, very, very, very hard. I don't believe there are any known attacks for collisions for SHA-256.


I think even a collision (any collision) has yet to found.


People make too big a deal of this collision stuff, a lot of times these are very theoretical would require tremendous computation. Anyway, for this use case, even md5, how likely really to make a useful malicious that file collides with a particular known and widely used one? I dunno seems pretty unlikely.


And if you worry about that you can always use 384. Plus a side benefit is that 384 is faster on a 64-bit processor.


It would be interesting if browsers start implementing a content-addressable cache. So as well as caching resource by URI also cache by hash. Then SRI requests could be served even if the URL was different.

Of course this would need a proposal or something but it would be interesting to consider.


Plan9's Venti file storage system is content addressable.

http://plan9.bell-labs.com/sys/doc/venti/venti.html

Also available on *nix


> How easy is it to create a malicious file that matches the checksum of a known file?

As others have pointed out, it's quite difficult. But here's another way to think about it: if hash collisions become easy in popular libraries, the whole internet will be broken and nobody will be thinking about this particular exploit.

Servers won't be able to reliably update. Keys won't be able to be checked against fingerprints. Trivial hash collisions will be chaos. Fortunately, we seem to have hit a stride of fairly sound hash methods in terms of collision freedom.


I think this vaguely reminds me of the Content Centric Networking developed by PARC. There's 1.0 implementation of a protocol on github (https://github.com/PARC/CCNx_Distillery). A CCNx enabled browser could potentially get the script from a CCN by referring to it's signature alone (it being a sha-256 checksum or otherwise).


This seems a little redundant - why not just

    <script src="jQuery-1.12.2.min.js" sha-256="31be012d5df7152ae6495decff603040b3cfb949f1d5cf0bf5498e9fc117d546"></script>
? If you wanted to explicitly fetch from google if the client doesn't have a cached copy, then instead do

    <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.12.2/jquery.min.js" sha-256="31be012d5df7152ae6495decff603040b3cfb949f1d5cf0bf5498e9fc117d546"></script>
The first would seem preferable though, as loading from an external source would expose the user to cross-site tracking.


> The first would seem preferable though, as loading from an external source would expose the user to cross-site tracking.

You're right in that the first one you had with just the sha-256 would be pretty much equivalent as what I had especially given that hn readers have resoundingly given support to the idea that it is non-trivial to create a malicious file with the same hash as our script file. I was simply trying to be cautious and retain some control for the web application (even if the extra sense of security is misplaced).

This is the use case I'm trying to protect by adding a new "canonical" reference that the web application decides. As others in this thread have said, it is very unlikely that someone will be able to craft a malicious script with the same hash as what I already have. The reason I still stand by including both is firstly compatibility (I hope browsers can simply ignore the sha-256 hash and the authorized cache links if they don't know what to do with it).

As a noscript user, I do not want to trust y0l0swagg3r cdn (just giving an example, please forgive me if this is your company name). NoScript blocks everything other than a select whitelist. If the CDN happens to be blocked, my website should still continue to function loading the script from my server.

My motivation here was to allow perhaps even smaller companies to sort of pool their common files into their own cdn? <script src="jimaca.js" authoritative-cache-provider="https://cdn.jimacajs.example.com/v1/12/34/jimaca.js""></scri... I also want to avoid a situation where Microsoft can come to me and tell me that I can't name my js files microsoft.js or something. The chances of an accidental collision are apparently very close to zero so I agree with you that there is room for improvement. (:

This is definitely not an RFC or anything formal. I am just a student and in no position to actually effect any change or even make a formal proposal.


If accompanied by the exact same sha-256 hash idea, loading from any external source cannot expose the user to any additional risk.

SHA + CDN url list (for whitelisting/reliability purposes - public/trusted, and then private for reliability) would be ideal.


> The problem is that almost everyone hosts their own version of jQuery.

Any site that expects users to trust it with sensitive data should not be pulling in any off-site JavaScript.

As for checksumming, browser vendors don't need to pre-load popular JavaScript libraries (though they might choose to do so, especially if they already ship those libraries for use by their extensions and UI). But checksum-based caching means the cost still only gets paid by the first site, after which the browser has the file cached no matter who references it.


Yes but no.

    - jquery.com becomes a central point of failure and attack;
    - jquery gets to be the biggest tracker of all time;
    - cache does not stay forever. Actually, with pages taking 3Mo everytime they load, after 100 click (not much), I invalidated 300 Mo of cache. If firefox allow 2go of cache (it's lot for only one app), then an the end of the day, all my cache has been busted.


> If everyone simply linked the "canonical" version of jQuery (the CDN link is right on their site) then requiring jQuery will be effectively free because it will be in everyone's cache.

So create one massive target that needs to be breached to access massive numbers of websites around the world?

Imagine if every Windows PC ran code from a single web page on startup every time they started up. Now imagine if anything could be put in that code and it would be ran. How big of a target would that be?

While there are cases where the performance is worth using a CDN, there are plenty of reasons to not want to run foreign code.

(Now maybe we could add some security, like generating a hash of the code on the CDN and matching it with a value provided by the website and only running the code if the hashes matched. But there are still business risks even with that.)


The solution to this is including a checksum with the link to the file, and if the checksum doesn't match, don't load the file.

See https://developer.mozilla.org/en-US/docs/Web/Security/Subres... though it isn't universally supported yet.


Just so I understand, I pull the file and make a checksum, then hardcode it into the link to the resource in my own code? Then, when the client pulls my code, follows the link, checks the checksum against the one I included in the link.


Yes. It's very simple. I don't know why more library providers don't have it in their copyable <script> snippets.


I agree with you that putting JQuery in the browser is a bad idea, but the second part of the argument, that the browser will have the libraries in cache is not really that reliable. Here's some notes I collected on the subject: http://justinblank.com/notebooks/BrowserCacheEffectiveness.h....


I was definitely speaking whimsically because it is a huge chicken and egg problem. But theoretically if every site that used jQuery referenced a version at https://code.jquery.com/ there would be a very good hit ratio, even considering different versions. However we are a very, very far way away from that.


That may seem like a good solution for some sites, but the name of the page and the requestor's IP address and other information is also 'leaked' to jquery.com. This is not always welcomed. For example, a company has an acquisition tracking site (or other legal-related site) and the name of the targets are part of the page name (goobler, foxulus, etc.) which get sent as the referrer page and IP address to jquery.com or other third party sites/CDNs. While not a security threat, you may unwittingly be recommending an unwanted information leak.


But it's only leaked when you don't have it cached. Otherwise the client doesn't even make the request.


There's the firefox addon Decentraleyes https://github.com/Synzvato/decentraleyes which is trying to solve those problems, currently only the most common versions of each library are bundled, but there are plans to create more flexible bundles.

There's no reason to hit the webserver with an If-modified when the libraries already include their version in the path.


If everyone linked the "canonical" version of jquery, then that location would be worth a considerable amount of money to any sort of malicious actor. Just getting your version there for a few minutes could rope in many thousands of machines. Look at China's attack on github earlier in the year for examples of the damage that that sort of decision could do.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: