That upgrader using binary differences (courgette) is impressive. From 10 megabytes to 78 kilobytes. I wonder why Linux distributions such as Ubuntu still download the entire new packages on an upgrade. A lot of upgrade time and bandwidth could be saved by only sending the differences. And it would reduce load on the mirror sites.
Edit: did a bit of looking around and it seems to be planned for Oneric Ocelot
> I wonder why Linux distributions such as Ubuntu still download the entire new packages on an upgrade. A lot of upgrade time and bandwidth could be saved by only sending the differences. And it would reduce load on the mirror sites.
Speaking as someone who has worked on this problem for my own projects [1], I think I can answer this.
Fedora/Yum already supports downloading a binary diff between rpm packages to reduce download size, but this also requires keeping a cache of previous rpm's to run the patch against. There are multiple reasons why you can't rely on binary diffs against the files actually stored on the system, most namely for files like /etc/* that are more than likely modified since installation.
But the real problem with binary diffs is that unless you're doing what Google does to ensure that people stay up to date, the number of binaries you need to diff against grows very quickly, and there are a lot of edge cases to take care of.
For example, let's assume some package A has been released as version 1, 2, and 3. When A has a new release 4, you obviously want to build a diff against release 3, but then you also most likely need or want to build a diff against 2 and maybe even 1 to take care of people who haven't already upgraded to 3. And even if you build a diff against every single version ever released, you will still always need to provide a full version of the package as well for two cases:
1. New installations, or reinstallations, of the package.
2. When the user has cleared their package cache to save room.
And even beyond that, creating diffs involves a lot more effort and knowledge on the part of the packaging team because they not only need to know how to build those diffs, but they also need to keep track of old package versions to build those diffs against.
The end result is that you trade download bandwidth and time on part of the server and end users for a lot of effort, time, and storage space on part of the packagers and distro mirrors. For mirrors that are already encroaching on 50GB for a single release of Ubuntu and/or Fedora, adding a whole bunch of binary diff packages will most likely grow the repository size by at least 30-50%, if not more, depending on how many old versions you diff against.
The question then becomes: does this trade off actually make sense, or does it present further roadblocks for contribution from packagers and donated mirrors?
[1]: If you would like to see how I handled this sort of task, I have a Python library I wrote to handle the client side updating. I know it's not the entire piece of the puzzle because it doesn't cover generating the updates, but it might be useful for someone else. http://github.com/overwatchmod/combine
> When A has a new release 4, you obviously want to build a diff against release 3, but then you also most likely need or want to build a diff against 2 and maybe even 1 to take care of people who haven't already upgraded to 3.
You wouldn't need a diff for every combination (1->2, 1->3, 1->4, 2->3, 2->4, 3->4 in the four version case), just the three diffs 1->2, 2->3 and 3->4. Then if someone has v2 you send out the diffs for 3 and 4 and have the client apply both in order to make the updated package ready to apply. The saving in space and number of diffs stored will grow as the number of versions grows.
This will be a little less efficient in terms of bandwidth use on average when people are skipping a couple of versions, but will make little or no difference if people are upgrading in a timely manner (so are only moving in single version steps most of the time) and will save space over storing diffs between all versions.
As well as the diffs I would store checksums for each version and have the client send the checksum for the version they have just-in-case, to avoid sending a diff (sending the file package instead) if the reference file seems corrupt.
Also if you store diffs for both directions you can serve old versions of packages (in case people need to roll-back due to some unexpected incompatibility, or they are developers needing to build a test environment with older library versions) without storing every version completely. This increases the diffs per package, but not nearly as much as storing one diff between every version (for 11 versions, 1 diff per change is 10, diffs in both directions is 20, diffs between all versions totals 45 (or 90 for both directions), for 21 versions those numbers are 20, 40, 190, 380).
In my experience, diff + diff + diff = super-long update process. It was done for many game updates up to several years ago, and an update from the box version through 3 patches could take up to an hour. The install, meanwhile, would take 20 minutes or so at most.
edit: not that I don't think this can be improved, nor that the updating software they used was any good. Just sayin'.
Often with game updates to get around the fact that patching the compressed multi-asset files directly would not be efficient (as a change early in the file would mean the rest of the file to the end would need patching unless they had the foresight to use something like gzip's rsync-friendly option), the patch would unpack the compressed resource files, patch them, then recompress. Depending on how granularly the assets are distributed amongst the installed files and how many of them were being touched by the patch this could be a lot slower than just reading the compressed file from CD to hard-drive which is what the installer would do.
Of course this means that if using the multi-diff method of update distribution you would need to be careful about your selection of compression arrangement to avoid the same inefficiency (unless the saving in bandwidth for the client and the package storage servers is far more important than a bit of extra time spent on the updates client-side)..
i expect those game updates didn't use bindiffs, they had to repackage the entirety of each updated file. when you consider that many developers do things like compress all game content files into a few large archives, it's easy to see how that adds up. you end up re-downloading the same content repeatedly when they repeatedly make small changes to a few scattered files in one large archive.
That makes a good case for some of the existing mirrors to provide 'diffserv'-like facilities. A small daemon could do that server-side, and the client may just need a hash of the prior package's binary file.
Then it downloads the diff and the new file hash :-)
Although it'd be nice to have diffs available for all earlier versions but it doesn't need to be mandatory. A simple way to get started would be, for example, to make the diffs available for the last version released by Ubuntu 11.04 as well as the last security update on top of 11.04 -- which I guess should take care of the majority. If people are running older versions of the packages, or if they haven't applied security updates regularly, they download the complete binary.
That's exactly how I handled it for my project, but it's still a lot more complicated than just always downloading a full package. Which is exactly what I was trying to get across. It's not impossible, just a lot more complicated on every part of the equation.
I think you can use the files stored on the system itself in many cases, at least for binaries. /etc/ and other configuration is an exception, which you could special-case. As config files are generally small files, this is no problem.
Of course you should check whether the file you are going to patch is the file you assume it is, but this is easily built-in to binary diff using a hash.
If a file doesn't match, err on the safe side and simply fetch the entire package.
You mentioned storing diffs against every previous version. In principle, couldn't you, when a new package is pushed to the repository:
1. Diff against the previous version and store the diff.
2. Delete the previous version.
You could upgrade from any previous version by applying all the diffs in sequence, and you only need to keep one full version around. You could also discard diffs after a certain date because, as you point out, the worst case is that the full version is used instead.
I guess this would increase disk space for the mirrors, but even 100GB wouldn't be a lot of disk space, and the savings for (presumably more expensive) internet data transfer would be a lot bigger?
> You mentioned storing diffs against every previous version
I didn't say you need to. Eg, with my software project, I wrote an update system that supported diff upgrading against the two latest versions, and anyone still running an older version had to download a full update.
> You could upgrade from any previous version by applying all the diffs in sequence
At that point you also need to make sure that you aren't downloading more in the process of applying a series of patches than you would need to download for a full update, which also means you need to start being aware of multiple update options, which balloons the complexity of your update code.
> At that point you also need to make sure that you aren't downloading more in the process of applying a series of patches than you would need to download for a full update, which also means you need to start being aware of multiple update options, which balloons the complexity of your update code.
The obvious way to handle this is to store full package "snapshots" on "major" version releases (probably upstream releases for packages with lots of local patching, or "one level up from the bottom" releases) and diffs in between. That is not a lot of code if you already have a sane way of managing release numbers within your package manager, which you hopefully do.
> At that point you also need to make sure that you aren't downloading more in the process of applying a series of patches than you would need to download for a full update
Well you wouldn't really need to but it would be a good sanity check on the client side.
The common case could be that people are simply keeping up with the stable edge w/o patching binaries themselves out of band---that's the case with some desktop linux variants.
I agree that this is a major failing of the mac app store as it currently stands.
Given that for updates there is a high likelihood that the previous version is on-disk, sending some type of diff would be very beneficial to end users. Even if it is a diff for only the previous to the current version, with a full download required outside that window.
For the record, pacman already supports diffs like this, but Arch has not set up official mirrors that host binary diffs instead of full packages. There has been at least third-party repository that hosted diffs for ArchLinux packages.
I too think this should be a much higher priority than it is for many. Fedora has had (non-default) support for this for years, but that's about it. You shouldn't worry so much about diffing against previous versions -- if you diff against the last two versions, it won't use much extra disk space, and the worst case scenario is that someone has to download the full package as a fallback, which everyone has to do now.
Lately, I've been less bothered about the download sizes of updates, and more bothered by the update-install times. Takes under a minute to download the usual 100-1000MB update, but then it's 5-15min to install it, be it ubuntu, ps3 firmware or xcode4. Providing bigger stuff on readily usable squishfs images instead of tarballs, even if vastly bigger might actually make my update times shorter.
Maybe people are looking wrong at Chrome version numbering. Take GNU Emacs for instance. At some point the developers realized that their software would never be the subject of a change in nature big enough to change the major version number, so they ditched it. Now we have Emacs 23 but it's actually Emacs 1.23, and nobody complains.
I think it's really a non-issue and it's not really worth talking about: Chrome just doesn't display the '1.' (or '0.' depending on your view point ^^) in front of its version number :-).
Yeah, its more a release number. It makes sense to use a single positive integer and simply increment it every release.
Having said that, though, I quite like Semantic Versioning[1]. The advantage it has over a single incrementing counter is that you know when API compatibility changes.
Right. Think of a new version of some software that drops support for a deprecated part of an API in a minor release. I did get screwed by that happening with a Latex package, xy-pic, some years back.
Yeah. Painful. Thats the kind of thing semantic versioning aims to solve: within a major release, all minor and patch releases must maintain backwards compatibility. Patch releases may not change the API at all; minor releases may add to it but not remove things[1]. Major releases may drop backwards compatibility by removing or modifying API items. Seems like a reasonable system to me and would avoid the problem you mentioned.
[1] I guess it can deprecate things, as long as they are still available for use.
Semantic Versioning looks like a really nice solution. At the moment our internal versioning philosophy is A.C.F - adding.changing.fixing but I can see that something more robust has its uses.
This is exactly the sort of visionary engineering needed to break the field into the next stage. This isn't just a quantitative difference, it's a revolutionary qualitative difference!
Our online infrastructure is broken in ways we're dimly aware of, because it has always been that way. In the same way that people trying to do business demand network, electric, and roadway infrastructure that once didn't exist, we will someday demand software infrastructure with features that do not exist today.
Chief among these will be security features. If Google plays their cards correctly, they can create an ecosystem that stays ahead of the black-hat hackers. By correctly incentivizing white-hat hackers, they could expose and patch security holes fast enough to ruin the economics of the black-hats. This infrastructure will enable Google to make more money, resulting in a virtuous cycle.
If the infrastructure can be extended to the server-side, with web app frameworks that receive security updates with equal rapidity, then Google can establish a secure, smoothly running "toll road" -- an infrastructure subset relatively free from problems faced by the rest of the net. That could be worth billions.
(We'll know this strategy is winning if/when Microsoft starts doing it too. Once that happens, we'll be in a new era of computing.)
There's a bicycle shop in the area called "virtuous cycles" and this is the first time I've realized that they mean the antonym of "vicious cycle". I always thought they were just vaguely religious.
They already do it with Automatic Updates. Turn the update dial to 11 and let your machine apply them at night. I don't believe they provide binary diffs for updates, but I believe it's for logistical reasons rather than technological (e.g., title updates over XBL are surprisingly small).
Of course, MS also hasn't figured out how to update components in-place while they're being used, so expect your machine to be restarted in the morning. :-/
They already do it with Automatic Updates...Of course, MS also hasn't figured out how to update components in-place while they're being used, so expect your machine to be restarted in the morning. :-/
The part before the ellipsis is contradicted by the part afterwards. Also, Microsoft will have to get the patch to the update mechanism very quickly as well. Their record with that has also been poor. Otherwise, they will not meet a goal of keeping ahead of the black hats.
Just because your military has tanks, jet fighters, and assault rifles, it doesn't mean they're on the same level as everyone with the same equipment. There are significant organizational factors at play.
> Somehow, we have to be able to automatically update software while it is running without interrupting the user at all. Not if -- but when -- the infinite version arrives, our users probably won't even know.
For what it's worth, this is already available in Erlang (although it was built in for different reasons, closer to getting the fluidity of web applications updates on just about any server software): two versions of the same code can live in parallel in the VM, and there are procedures for processes to update to "their" new version without having to restart anything (basically, you switch functions mid-flight and the next time an updated function is called the right way, the process just switches to the new code path).
You need follow a few procedures and may have to migrate some states, but by and large it's pretty impressive. And it could certainly be used for client-side software. The sole issue I'd see would be the updating of a main GUI window in-flight (how do you do that without closing and re-opening it?). But I doubt this one changes that much in e.g. chrome these days.
I've caught some flack on HN here recently for saying it's past time to move beyond C, but this is a great example from the real world. Yes, you can update C code live with the proper magic invocations, but you have to be a wizard and even then probably still a bit lucky for it all to work. Or you can build an infrastructure that at the most primitive layer contains this ability, and then create systems where the programmer is encouraged to maintain the invariants that permit this update, and it may still not be trivial and still requires thought but no longer requires a wizard, so maybe it will actually happen.
And there are just so many features like this that we need to get all the way down to the OS level before we can fully harness them, and we aren't going to get them in C. We need an upgrade of our fundamental programming primitives.
Following ideas from both erlang and android, each gui screen within the application would be attached to a single process. When a new screen is opened the process is killed and a new process starts. If new code has been loaded the new process will switch.
Screenwise is not really a problem, you can just delete and reload the controls the next time the thing becomes visible. To get back to chrome, you could even swap the JS engine and DOM: leave the old one running in existing tabs, switch new or reloaded tabs to the new one, that's pretty much seamless.
The issue I have is for the "static" chrome around the mobile parts: title bar, URL bar, that kind of stuff. There isn't much opportunity to switch that to a new process, I think. Especially if there are changes to make to the UI.
Fortunately, those don't actually need to be changed very often (no matter what the Chrome team says). As long as the rendering, JavaScript, and other parts that touch untrusted data stay up to date, you're probably fine with a URL bar that hasn't had today's unfashionable components removed.
I did note that in my original comment, but you need to note the following: if the software never needs to be restarted, you will find somebody who never restarts it. The result is that sections of the code base may get out of sync, resulting in nonsensical behavior. 6 months from now, somebody will push code which does not work anymore with right now's awesomebar (or whatever), and the results may be minor or may lead to significant loss of state and information.
Chrome already automatically adds/removes browserAction buttons for syncing on the fly. I don't see any reason Chrome couldn't wait to be inactive for 5 minutes and just change it. Most people would never notice the faint flicker.
There are disadvantages to constant, automatic updates.
I had a call from someone who'd been using Chrome to regularly print a web page, and one day it just stopped working. The site hadn't changed, but for whatever reason the latest version of Chrome just didn't render it.
And of course trying to install an older version of Chrome was quite difficult.
(In Google's case they do now have a way to disable the updates, but not all software is so good about it)
I experienced this yesterday. My website renders differently now than it used to on Chrome last month. Firefox and IE still look "correct". I used to think the auto-update feature on Chrome was great, but now I'm not sure. I can see why some companies still stick to IE 6 internally. It's stable.
I stopped looking at Chrome's version numbers (unless I have a specific issue or question about Chrome) back around 9. That's because 9 was the last development version I used... The features I need are all in the stable release now. When 10 came out, my 9-dev turned into 10-stable and I didn't pay attention from there.
At this point, I don't even bother 'updating' (read: close the browser and open it again) for up to a week or 2 after an update comes out, unless I need to close my browser for some other reason.
The Canary build goes out automatically without being looked at by a human, so there's a very real possibility that it will be unstable. Currently there's some strange crasher in the PDF viewer, for example.
You can install Canary "side-by-side" with another channel, so you can switch back to something more stable if canary goes pear-shaped.
Apples App Store, for both Mac and iOS, could learn a thing or two from this, their software update experience is awful, requiring you to re-download whole multi-gigabyte apps sometimes for minimal updates.
Microsoft actually went to great length to build an update mechanism that doesn't require reloading. It seems this is not so useful after all, and it's not being used: http://jpassing.com/2011/05/01/windows-hotpatching/
There are software systems which do get updated while running though, but perhaps it requires a change in software architecture more than just (very clever) diff tools. Erlang systems, for instance, can have the concept of hot code swapping baked in to them in a more predictable way because that requirement is part of the base system - application life cycle is built in to the platform, not on top of it. Of course, for systems such as telecoms switching, the complexity and cost of this was worthwhile. For browsers... perhaps not. Cost/Benefit analysis is probably the usual trusted friend. What would we hope to gain (and how would we measure it) by letting browsers never restart?
So how do you roll back with Chrome when it breaks a plugin for example?
I guess this means for ignorant users this is good but for power-users we are having more and more control taken away from us.
Personally I disable all of Chrome's phoning home because it's impolite and does it too many times per day and I have no easy way to verify exactly it's sending all those times.
It seems as though Google is trying to eliminate this by supporting their own plugins. Flash has been shipping with Chrome for a while, and a PDF reader has been shipping for months.
Those two plugins have 90% of people's plugin needs covered.
That's a great improvement over a generic binary diff. I remember Symantec was doing something similar for their AV definitions updates. In fact they got some patents: http://www.symantec.com/press/2001/n010207b.html
This is slightly offtopic but Wordpress built-in update feature only works in you have ftp on your server. If you've disabled FTP for security reason updating becomes a manual process. I wish the WP devs would use patch or some other, CLI friendly, solution.
"If you've disabled FTP for security reason updating becomes a manual process."
The automatic update process will also work if the webserver has write permissions on the files (which is a bad idea in the first place). But if it doesn't and you can't/won't give it FTP credentials for a user that does, you do need to go through a somewhat manual process.
Personally, I use vendor branching (http://svnbook.red-bean.com/en/1.1/ch07s05.html) in both SVN and Git in cases like this. I don't have to rely on the developers to generate a patch: I get all of the changes pulled directly from my local repository.
Agreed. I normally find the list of changed files from a blog post, download the .tar.gz and add them to my local git repo manually. I guess I could get the current SVN tag but doubt that would be simple with my setup.
Made a similar observation yesterday. Only times that the Chrome version has mattered in my recent experience have been with regards to the recent WebGL security hole and with Native Client.
1. Updates should not only be applied in sequence.
It is better to produce a binary diff between any two versions, and apply only that (one) binary diff. The reason for this isn't efficiency, but semantics. Updates not only fix things, but break things. Meaning, updates corrupt application state (data), both in-memory and on-disk. It can be disastrous to apply an intermediate update that removes state, only to realize that a future version reversed the semantics and needs to use that state (which was available, but is now gone).
Peserving backward compatibility is important, which means the ability to skip some version updates is necessary. To the extent possible, reversing updates is important too.
2. The ideal update system should apply updates live, not offline.
With a model that accounts for updating the entire state of an application, updating live is possible. The reason most updates are not applied live yet is that the model is not descriptive enough to change the entire state of the running application.
Notable state that should be updated, but often isn't, is continuations and the stack. This is why GUI applications need to be shut down to update.
Scheme's call/cc (call-with-current-continuation) solved making changes to continuations and stack state decades ago better than Erlang. Erlang cannot force stacks unroll or continue from arbitrary points.
3. Updates must be produced with source code and programmer input.
Updates should not be produced with binaries as input.
The reason is the need to account for application semantics, which binaries do not expose in the detail source code does. Although automated, sophisticated semantic-diffing based on control-flow can be developed, it is sometimes inconclusive whether an update will break things.
4. It is necessary for programmers to provide live update guidance.
In the cases where producing provably safe dynamic updates is not possible, it is input from the programmer that can clear any conservatism of the safety certification process.
Tools are needed for programmers to reason about the semantic safety of their live updates, integrated in the development process. Including tools that help transform application state between versions.
The screenshot you're referencing is from OS X! If you don't quit Chrome for a while, then open "About Google Chrome", that dialog or something very similar will display.
That hasn't been my experience. I run Chrome on 10.6 and when a new version comes out, I have to manually close/reopen Chrome to install it (usually by clicking the "Update and Restart" button shown in that screenshot).
It seems to me that the update process is slightly more jarring on OS X than it is on Windows (though I haven't used the Windows version so I am just assuming). When I finish using Chrome on OS X I will close the last tab with Cmd + W, so Chrome is still open and therefore it won't be able to install the update next time I open a tab. On Windows I would close the last tab and the application exits completely – though it might take slightly longer to open Chrome each time, I will never really have to manually restart Chrome to get the update installed.
Edit: did a bit of looking around and it seems to be planned for Oneric Ocelot
https://blueprints.launchpad.net/ubuntu/+spec/foundations-o-...