I think the ironic thing about Spolsky's example is that, in the end, it worked. It eventually gave birth to Firefox. So I think there are many lessons to be learned about how to do a rewrite, how to message it, how to do it in a way that it doesn't stop other forward progress. But if a rewrite fails, that doesn't necessarily mean that not doing a rewrite would have been any more successful in the end, and that's the trap I think people fall into.
Joels article was written in 2000 and Firefox 1.0 was eventually released in November 2004. In the meantime they lost almost all their market share. They eventually won some of it back, but this was probably only because Microsoft had stopped all development of IE and disbanded the IE team.
Usually you can't count on you competitors stopping all development for half a decade.
Obviously you can rewrite from scratch and end up with a fine product, since the original version of the was developed from scratch. The question is if you can accept losing all customers and wasting years of effort.
There is still more to the story though. When Netscape went under and open sourced their code, they had to rip out all of the proprietary bits. The result? What they had didn't actually run!
It wasn't a question of rewriting working code. It was a question about whether the better path forward was to try to fix what they had and didn't understand, or whether to rewrite into something that they understood.
To write "It was their own code. They could just read it." misses a great deal about code bases. The first and foremost is that the vast majority of working code bases do NOT have sufficient internal documentation to actually understand whys and wherefores of that code base.
The specific reasoning for doing something or using a specific technique or why the magic numbers, etc., are often not put into any form of accessible documentation. This is just a fact of life. Documenting such things can be very tedious and often thought of as not being needed because the code is "obvious". Yet to someone coming in to do work on it in six months let alone six years, it is vital missing information.
{Edits spelling mistakes - should have checked before submitting}
Code often sucks, but a rewrite from scratch doesn't change anything. The rewritten code will suck just as much, but now you just lost a lot of time and have new bugs.
I'm not sure what you mean with "didn't understand". It was their own code. They could just read it.
I meant exactly what I said.
The missing proprietary bits were not their own code. There was nothing to read and no documentation. They could see the call, and that something important happened, but what it did and how it should work was guesswork.
Furthermore key people were gone. How the code that they had was supposed to work is harder to dig out than you might think. Read http://pages.cs.wisc.edu/~remzi/Naur.pdf for an explanation of why.
Just to add to the fun, just getting the browser running wasn't enough. Netscape and IE were incompatible all over the place, on everything from how styling worked to what order event handlers ran. (In IE, the thing you clicked on gets the event first. In Netscape, the window did. That's a pretty fundamental difference!) They needed to not just have a running browser, but have a browser with enough IE compatibility that it could be usable on webpages designed for IE. It is easier to design something to be compatible from scratch than to find out what deep assumptions need to be changed in existing code to fix incompatibilities.
As fun as it was for Joel to beat up on them for their choice, their choice was arguably the best way forward given the actual situation.
>The missing proprietary bits were not their own code. There was nothing to read and no documentation. They could see the call, and that something important happened, but what it did and how it should work was guesswork.
To your point, the following link is another famous example of smart programmers not being able to easily understand someone else's source code. Raymond Chen at Microsoft couldn't easily figure out why Pinball didn't work in 64bit Windows and gave up:
I'm sure they could have figured it out if it was important enough. They just decided it was not worth it and scrapped the product instead.
"we couldn’t afford to spend days studying the code trying..."
So the estimated time to fix the bug was on the scale of days. It would certainly have taken longer to rewrite the game from scratch if that was the alternative.
I’ve never understood why they had to port it at all. The 32-bit version of Pinball would have kept working just fine. And apparently, the binary from Windows XP still runs on Windows 10.
Back when Joel's article was written there was also a response to it written by someone who was part of the decision. And that was one of the big points that stuck with me.
There were dozens of proprietary components that had to be removed. Java being the biggest (because it had been integrated into every part of the browser). They succeeded in removing them, but after 2 years of trying to fix the result, everything from print preview to text selection was broken. That was when they decided to scrap and rewrite.
Now consider. You've had 100 engineers working for 2 years and you don't have a functioning product or a realistic timeline for getting one. How many more years are you planning to throw into it before deciding that you're going down a dead end?
Now let's throw a monkey wrench into this. The main reason that the product is being funded was as a bargaining chip. Specifically, AOL had a deal with Microsoft to bundle a specially branded version of IE. And had every reason to believe that when the contract expired, Microsoft was not going to renew. AOL didn't care about browser market share, they wanted to have a viable alternative if Microsoft refused to negotiation. (As it turns out, Microsoft renewed the deal but part of the price was that AOL had to stop supporting Mozilla...)
So they had to change the Java integration to some kind of plug-in interface. Perhaps not trivial, but it would not require you to rewrite everything from the ground up. And it was well-understood what Java does. If it wasn't you still would have to figure it out in order to do the rewrite.
> Now consider. You've had 100 engineers working for 2 years and you don't have a functioning product or a realistic timeline for getting one.
You are definitely in serious trouble when it get to that point. But if you can't make a realistic estimate for fixing text selection, how can you make a realistic estimate for the much larger and more complex task of rewriting everything from scratch? What if they rewrite everything, but then after four years of work still can't get text selection to work?
What if the development team is not by a bus?
What if they're suddenly replaced with chimpanzees?
Literally collecting all the problems with the current code base and designing around them is the only thing anyone can reasonably do.
If you have a poor development team, you're just screwed - hire a better one. The problem there is figuring whom you can keep and in which position. Since nobody knows how to grade programmers (and stupid algorithmic tests do not work for design and engineering which is critical) you get to try until you run out of money and perhaps rely on peer or expert opinion.
As a point of reference, rewriting a GUI with a layout engine (not a web browser but a music editor, shares a lot of design) took a small team half a year and solved all of the problems reported and a bunch of others. Including performance. At 4 months we had an almost matching result with exclusion of single issues that took additional redesign. (Asked client if they already want to keep it, they said to finish it.)
And the testing was limited to manual checklists.
A web browser is bigger and testing somewhat easier to automate.
The true important thing we did is dig into requirements including fuzzy ones and designed using good practices making change easy.
(Including a few times when the client just changed the requirements.)
So they had to change the Java integration to some kind of plug-in interface.
No. They had to remove it entirely because back in 1998 there was no open sourced version of Java that was compatible with their license.
And it wasn't just that they had to rip out Java. The browser had been partly written in Java, and THAT was ripped out as well. And the parts that had been in Java needed to be rewritten.
But if you can't make a realistic estimate for fixing text selection, how can you make a realistic estimate for the much larger and more complex task of rewriting everything from scratch?
Because the second task is inherently easier to estimate.
Fixing a broken large code base involves a lot of chasing down rabbit holes to figure out what isn't working, what depends on what, etc. When you start down a rabbit hole you have no idea how deep it goes, or what else you are impacting. This is why maintenance is harder than writing fresh code. Furthermore entropy says that code will not improve through this process. Costs only go up.
Estimating a rewrite is a question of coming up with a clean design for the rewrite, and then estimating how long that design will take. This is still a notoriously hard problem to solve. But it is much easier in principle than estimating the depth of rabbit holes. And if your design is clean, there is little chance of not being able to get a feature like text selection to work.
But you are assuming the result of the rewrite will be cleaner and have less rabbit holes. There is no reason to think that. If it is developed by the same organization it will probably end up with the same level of quality. Perhaps worse since it will suffer from second-system effect.
Indeed Netscape 6 was more buggy than any of the previous versions.
Doing emergency surgery on a code base and not knowing how many holes are left is likely to leave more rabbit holes than a rewrite.
Otherwise I agree with you. Rewrites are tricky. And the second system is particularly dangerous. But then it gets better - the third system tends to be a lot better than either of the first two.
I think this is the siren song of putting a clever architectural decision front-and-center in the code.
"Everything is an X."
What happens when it turns out not everything should be an X, or worse, X ends up being harmful? How do you refactor such a system when the biggest mistake is pervasive?
[Edit: I have done it 2-3 times, but I'm not entirely sure I could adequately explain how I did it, which means I should still be asking that question.]
How many people actually paid for the Netscape browser in the first place? Was selling the browser ever a viable business model?
Netscape could have developed different ways of getting money, like ads on the start page or default search engine. But such schemes could only have worked if they had retained a large market share, which they jeopardized with the rewrite.
I was always a little irritated to see a boxed copy of Netscape in stores, but it did solve the 'how do I get my initial web browser?' problem for people who had no idea what "FTP" is. So I'd say that for the majority of people (the 90% who were not tech people), they bought it once. Once they got later in the adoption cycle that would have dropped off, and the moment IE shipped that would immediately trend to zero.
Netscape was free because Mosaic was free, and Mosaic was free because of the National Science Foundation. (As for getting your initial browser, IIRC you could send away for copy of Mosaic on 2 floppies for the cost of shipping and media, but that could take weeks)