A reminder that Joel Spolky's essay being cited is about people rewriting ugly code that already works.
That nuance is important because successful companies do invest in rewrites when there's an architecture change because it's the cleanest most economical way to do it. My previous comment of Microsoft examples: https://news.ycombinator.com/item?id=19245653
Another example is Google rewriting their early web crawlers of Python/Java into C++ to make them more efficiently scalable to thousands of machines. They also rewrote the frontend web server from C++ to Java.
But some rewrites also failed such as Evernote's rewrite to C#/WPF.
I think for the topic of "rewrites bad/good", it's better to list a bunch of famous real-world case studies and extract the common criteria that makes rewrites successful.
I was around for the Google Webserver rewrite and it was actually a rewrite done by refactoring not clean slate. It's amazing what you can do with disciplined refactoring, even moving to a new language.
And if you don't have a decent test suite, you might as well build one. So much of a rewrite is archaeology, figuring out what elements of a system are intentional versus accidental. A great way to do that archaeology is to start weaving a net of tests around the existing system. Sort of like Fowler's Strangler Fig pattern: https://martinfowler.com/bliki/StranglerFigApplication.html
Yeah, if you don't have a detailed test suite, and can't build one, you probably don't fully understand your requirements, and your rewrite is in trouble already.
Mind you, maintenance is in trouble too. Just a different kind of trouble.
If you have original code, you can always write a test suite based on it, unless management is averse to this. (Begging the question why they're not averse to a rewrite.)
Like any test suite it is not going to be complete.
Ultimately it's best not to get piled on by rewriting really and often in small parts. (Carmack way.)
This sounds really interesting to read about, especially refactoring into an entirely new language. Are there any articles about what that process looked like for Google at the time? Or any example that changed languages really.
Not the OP, and no experience with google, but I have personally worked in a way that sounds as if it could be similar:
First, you start with a slow but correct prototype in python. This can then be refactored into python closer to C, thinking about the data structures and functions that will be needed in C.
Then, you can refactor these bits into C. I've usually done this one module at a time.
You then write python bindings to this new C code. There are a number of shortcuts available here with differing tradeoffs, but you can just use the Python/C API.
In the end you're left with a C library, some bindings, and some python that just calls to the python bindings for the C library.
It would not be a huge leap to go from that to a full C implementation.
Nothing I've said here is C specific, so you can do the same with C++.
There are upsides and downsides to this whole approach. For me, one of the downsides is the amount of stuff to remember, especially some of the quirks of extending Python.
I find it easier to start with something correct but high-level in Python and work down to something easier to translate into C, even if I skip straight to the writing in C bit. I have found this usually leads to a significantly better design whilst it was still Python. You can also test the behaviour of the C and python modules against each other.
I think the ironic thing about Spolsky's example is that, in the end, it worked. It eventually gave birth to Firefox. So I think there are many lessons to be learned about how to do a rewrite, how to message it, how to do it in a way that it doesn't stop other forward progress. But if a rewrite fails, that doesn't necessarily mean that not doing a rewrite would have been any more successful in the end, and that's the trap I think people fall into.
Joels article was written in 2000 and Firefox 1.0 was eventually released in November 2004. In the meantime they lost almost all their market share. They eventually won some of it back, but this was probably only because Microsoft had stopped all development of IE and disbanded the IE team.
Usually you can't count on you competitors stopping all development for half a decade.
Obviously you can rewrite from scratch and end up with a fine product, since the original version of the was developed from scratch. The question is if you can accept losing all customers and wasting years of effort.
There is still more to the story though. When Netscape went under and open sourced their code, they had to rip out all of the proprietary bits. The result? What they had didn't actually run!
It wasn't a question of rewriting working code. It was a question about whether the better path forward was to try to fix what they had and didn't understand, or whether to rewrite into something that they understood.
To write "It was their own code. They could just read it." misses a great deal about code bases. The first and foremost is that the vast majority of working code bases do NOT have sufficient internal documentation to actually understand whys and wherefores of that code base.
The specific reasoning for doing something or using a specific technique or why the magic numbers, etc., are often not put into any form of accessible documentation. This is just a fact of life. Documenting such things can be very tedious and often thought of as not being needed because the code is "obvious". Yet to someone coming in to do work on it in six months let alone six years, it is vital missing information.
{Edits spelling mistakes - should have checked before submitting}
Code often sucks, but a rewrite from scratch doesn't change anything. The rewritten code will suck just as much, but now you just lost a lot of time and have new bugs.
I'm not sure what you mean with "didn't understand". It was their own code. They could just read it.
I meant exactly what I said.
The missing proprietary bits were not their own code. There was nothing to read and no documentation. They could see the call, and that something important happened, but what it did and how it should work was guesswork.
Furthermore key people were gone. How the code that they had was supposed to work is harder to dig out than you might think. Read http://pages.cs.wisc.edu/~remzi/Naur.pdf for an explanation of why.
Just to add to the fun, just getting the browser running wasn't enough. Netscape and IE were incompatible all over the place, on everything from how styling worked to what order event handlers ran. (In IE, the thing you clicked on gets the event first. In Netscape, the window did. That's a pretty fundamental difference!) They needed to not just have a running browser, but have a browser with enough IE compatibility that it could be usable on webpages designed for IE. It is easier to design something to be compatible from scratch than to find out what deep assumptions need to be changed in existing code to fix incompatibilities.
As fun as it was for Joel to beat up on them for their choice, their choice was arguably the best way forward given the actual situation.
>The missing proprietary bits were not their own code. There was nothing to read and no documentation. They could see the call, and that something important happened, but what it did and how it should work was guesswork.
To your point, the following link is another famous example of smart programmers not being able to easily understand someone else's source code. Raymond Chen at Microsoft couldn't easily figure out why Pinball didn't work in 64bit Windows and gave up:
I'm sure they could have figured it out if it was important enough. They just decided it was not worth it and scrapped the product instead.
"we couldn’t afford to spend days studying the code trying..."
So the estimated time to fix the bug was on the scale of days. It would certainly have taken longer to rewrite the game from scratch if that was the alternative.
I’ve never understood why they had to port it at all. The 32-bit version of Pinball would have kept working just fine. And apparently, the binary from Windows XP still runs on Windows 10.
Back when Joel's article was written there was also a response to it written by someone who was part of the decision. And that was one of the big points that stuck with me.
There were dozens of proprietary components that had to be removed. Java being the biggest (because it had been integrated into every part of the browser). They succeeded in removing them, but after 2 years of trying to fix the result, everything from print preview to text selection was broken. That was when they decided to scrap and rewrite.
Now consider. You've had 100 engineers working for 2 years and you don't have a functioning product or a realistic timeline for getting one. How many more years are you planning to throw into it before deciding that you're going down a dead end?
Now let's throw a monkey wrench into this. The main reason that the product is being funded was as a bargaining chip. Specifically, AOL had a deal with Microsoft to bundle a specially branded version of IE. And had every reason to believe that when the contract expired, Microsoft was not going to renew. AOL didn't care about browser market share, they wanted to have a viable alternative if Microsoft refused to negotiation. (As it turns out, Microsoft renewed the deal but part of the price was that AOL had to stop supporting Mozilla...)
So they had to change the Java integration to some kind of plug-in interface. Perhaps not trivial, but it would not require you to rewrite everything from the ground up. And it was well-understood what Java does. If it wasn't you still would have to figure it out in order to do the rewrite.
> Now consider. You've had 100 engineers working for 2 years and you don't have a functioning product or a realistic timeline for getting one.
You are definitely in serious trouble when it get to that point. But if you can't make a realistic estimate for fixing text selection, how can you make a realistic estimate for the much larger and more complex task of rewriting everything from scratch? What if they rewrite everything, but then after four years of work still can't get text selection to work?
What if the development team is not by a bus?
What if they're suddenly replaced with chimpanzees?
Literally collecting all the problems with the current code base and designing around them is the only thing anyone can reasonably do.
If you have a poor development team, you're just screwed - hire a better one. The problem there is figuring whom you can keep and in which position. Since nobody knows how to grade programmers (and stupid algorithmic tests do not work for design and engineering which is critical) you get to try until you run out of money and perhaps rely on peer or expert opinion.
As a point of reference, rewriting a GUI with a layout engine (not a web browser but a music editor, shares a lot of design) took a small team half a year and solved all of the problems reported and a bunch of others. Including performance. At 4 months we had an almost matching result with exclusion of single issues that took additional redesign. (Asked client if they already want to keep it, they said to finish it.)
And the testing was limited to manual checklists.
A web browser is bigger and testing somewhat easier to automate.
The true important thing we did is dig into requirements including fuzzy ones and designed using good practices making change easy.
(Including a few times when the client just changed the requirements.)
So they had to change the Java integration to some kind of plug-in interface.
No. They had to remove it entirely because back in 1998 there was no open sourced version of Java that was compatible with their license.
And it wasn't just that they had to rip out Java. The browser had been partly written in Java, and THAT was ripped out as well. And the parts that had been in Java needed to be rewritten.
But if you can't make a realistic estimate for fixing text selection, how can you make a realistic estimate for the much larger and more complex task of rewriting everything from scratch?
Because the second task is inherently easier to estimate.
Fixing a broken large code base involves a lot of chasing down rabbit holes to figure out what isn't working, what depends on what, etc. When you start down a rabbit hole you have no idea how deep it goes, or what else you are impacting. This is why maintenance is harder than writing fresh code. Furthermore entropy says that code will not improve through this process. Costs only go up.
Estimating a rewrite is a question of coming up with a clean design for the rewrite, and then estimating how long that design will take. This is still a notoriously hard problem to solve. But it is much easier in principle than estimating the depth of rabbit holes. And if your design is clean, there is little chance of not being able to get a feature like text selection to work.
But you are assuming the result of the rewrite will be cleaner and have less rabbit holes. There is no reason to think that. If it is developed by the same organization it will probably end up with the same level of quality. Perhaps worse since it will suffer from second-system effect.
Indeed Netscape 6 was more buggy than any of the previous versions.
Doing emergency surgery on a code base and not knowing how many holes are left is likely to leave more rabbit holes than a rewrite.
Otherwise I agree with you. Rewrites are tricky. And the second system is particularly dangerous. But then it gets better - the third system tends to be a lot better than either of the first two.
I think this is the siren song of putting a clever architectural decision front-and-center in the code.
"Everything is an X."
What happens when it turns out not everything should be an X, or worse, X ends up being harmful? How do you refactor such a system when the biggest mistake is pervasive?
[Edit: I have done it 2-3 times, but I'm not entirely sure I could adequately explain how I did it, which means I should still be asking that question.]
How many people actually paid for the Netscape browser in the first place? Was selling the browser ever a viable business model?
Netscape could have developed different ways of getting money, like ads on the start page or default search engine. But such schemes could only have worked if they had retained a large market share, which they jeopardized with the rewrite.
I was always a little irritated to see a boxed copy of Netscape in stores, but it did solve the 'how do I get my initial web browser?' problem for people who had no idea what "FTP" is. So I'd say that for the majority of people (the 90% who were not tech people), they bought it once. Once they got later in the adoption cycle that would have dropped off, and the moment IE shipped that would immediately trend to zero.
Netscape was free because Mosaic was free, and Mosaic was free because of the National Science Foundation. (As for getting your initial browser, IIRC you could send away for copy of Mosaic on 2 floppies for the cost of shipping and media, but that could take weeks)
A few years ago, I worked on a rewrite of a system that had been implemented in ColdFusion (yes, a few years ago - well into the 21st century). I can't imagine even Spolsky arguing that it shouldn't have been rewritten at that point.
Even with rewrites, there are good rewrites and bad rewrites. For my money, big-bang rewrites are bad. If you can strangle out the old system as functionality is gradually moved across into the new one, that can work, but still relies on actually finishing the transition so the old system goes away.
My team has been working on a big-bang rewrite for about 5 years now. It started out as the "strangle" approach. Early on, everyone was on board with killing the old thing. But of course, you can't stop developing new features; you still need to make some forward progress in the midst of developing the new app. At the beginning we were 80% rewrite / 20% new feature work. But as time goes on, that balance has pretty much completely flipped; we're now more like 95% new feature work / 5% rewrite. Once we reached a certain critical mass of compatibility, the appetite for finishing the work and completely killing the old thing has pretty much disappeared. And there's a large loud minority that hate change and complain about having to use the new tool vs. the old tool.
FWIW, we were rewriting a Visual Basic 6 application using web technologies. It's VERY hard for me to imagine saying "don't rewrite that VB6 application, just make it better". It's literally not even supported by Microsoft anymore, it was written by one person at the company who has since left, no one else here writes VB6, etc. etc. It was a critical business application, so it wasn't the kind of thing anyone wanted to just leave alone and keep on life support.
Even still, it's hard for me not to wonder if we did the right thing. Maybe we should have just hired a bunch of VB6 contractors to maintain the application and continue adding functionality to it. Maybe moving it to VB.net would have been better. Maybe there's some other migration path for VB6 applications...
Anyway, my biggest takeaway from all of this has been: think very very hard (and be honest with yourself) about how long a rewrite will take. Make sure "the business" and you are on the same page about how long that's going to take and making a serious commitment to focus on migrating to the new tool, and if you can, get it down in writing the conditions necessary to kill the old tool. If your users are anything like ours, they're going to have to be dragged kicking and screaming to the new tool, because we all hate change, so if you don't have clear conditions about when you're shutting the old stuff off etc. then it's going to be painful.
In saying "If your users are anything like ours, they're going to have to be dragged kicking and screaming to the new tool", I think you miss how the end-user sees things. Having worked with many such end-user teams over decades, what I found was that the end-user is seen as the "troublesome" bit of the organisation and have to be corralled like "wild animals" (to use an euphemism).
If they are consulted and treated as if they have serious input to what is being developed, here I mean what they see and what they do, they become much more amenable to the new system.
What end-users really don’t like is being told how to do their jobs by people who do not do their jobs. Especially, when they are now required (for management purposes) to do something that does not fit into their work-flows. This is where the development teams have a strong place to collect information without user intervention by being very smart.
However, this also means that the development teams have to show that they are NOT just cost centres but are “profit” centres for the business as a whole. This takes a lot of effort to achieve this in many organisations because of the biases of other parts of those organisations to the IT challenge.
I agree that it's not uncommon to hear things on the development team that basically amount to "if it weren't for our users our application would be so much simpler" :)
We have spent a significant chunk of time working directly with our users during this migration. This is an internal tool, so we can literally walk over and talk to our user base if we have questions or want suggestions on things. We have done surveys, we track using analytics, we have a user council that we meet with regularly to run ideas by and get feedback on our work and they come to our sprint reviews (well, 3 of them do...) to see what new work we've done etc. etc.
This doesn't change the fact that you can't make everyone happy. One user will tell us that they hate the way a certain page is designed and that they won't use it because it doesn't fit their workflow. Another user will tell us the exact opposite. So we have to take the feedback and make a judgement call, using the data we have available, to try and make the best decision. One great thing about having analytics is it helps you identify squeaky wheels. We had some people on our user council that would basically complain about every single thing we did. If you just listened to them, you would have been tempted to just give up. When we looked at the data, we realized 90% of our user base was getting A LOT of work done, and we weren't hearing anything from them. Obviously it was possible they also hated the app and just didn't want to tell us, but after reaching out it became clear that they were perfectly happy with the way it worked and were just getting shit done. In my experience, the temptation is just to listen to the people you're getting feedback from and assume that they're a representative sample. DON'T. Do the work (or put the systems in place) to help you vet the feedback you're getting.
Another thing that makes this very difficult is one of the things you're touching on; trying to develop a tool for a domain that you don't have intimate knowledge of makes it pretty difficult to know which way to go when you are getting conflicting information. None of the developers is an expert user in the domain (or even a beginner for that matter...) and so we often end up having to look at things from a higher level (i.e. what other applications have we used that function this way? How do they solve this problem? etc.). I think this mostly works, but there are bound to be some misses here or there.
I don't disagree with you that there are end-users who are, shall we say, "difficult". That is the nature of having to work with a wide variety of different types of people. That is one of those "unenviable" things that happen. As far as not being "problem domain experts", I have found that getting the co-operation of the "most-liked" and "most-expert" of your end-users in that area can be helpful as to getting most people on side to give you the best chance of gathering what you need. This is not necessarily an easy find as I have had to work with some who don't like to lose their "control" and that can be frustrating, to say the least.
My original comment wasn't meant to say that getting end-users on side was going to be easy, sometimes it is and sometimes it isn't. But the reputational boost you get when you do, eases the next time something has to be done.
Curious how much the interface changed and if you think that plays a role in the sentiment towards it. VB6 I'd expect a desktop app or add-ins to office, going from that to web changes maybe more than necessary for the users.
I think this plays a role in the difficulty of the migration, because yes, the interface has changed quite a bit. In general, we tried to hew as close to the old paradigms as made sense. We never changed anything just because we didn't like it; we tried to make sure that the transition would be easy and the users didn't have to drastically change their mental models. But of course, there are always going to be differences in look/feel/capability between native applications and the web. I will say the web has come a long way since I started as a web developer 15 years ago and it feels like closing this gap is more a priority these days.
But there were parts of the application that were poorly designed (our users told us so...) and so, even though they were the primary ones driving change, there's always pain when changing to a new way of doing things.
ColdFusion URL traversals were the first servers I cracked as part of the OSCP course. They seemed to be pretty horrible! A rewrite definitely sounds appropriate.
I think we should consider developer Morale when it comes to legacy systems - it’s hard to hire and a lot of people want to leave if they’re stuck with them.
>I think we should consider developer Morale when it comes to legacy systems - it’s hard to hire and a lot of people want to leave if they’re stuck with them.
Basically. Some systems/languages just don't have many people on the market with expertise in those things, and you may not be able to hire someone who does, and get them to relocate to where you are. If you try to stick someone else with that job, they're not going to want that on their resume, so they'll just look for a new job that does fit their career goals. For instance, personally, I have no desire to work on VB.NET code at all, so if my company suddenly decided to make me a full-time VB.NET developer (including training etc.), and wouldn't take no for an answer, I'd immediately look for a new job. Having VB.NET on my resume isn't much better than having a job gap.
How much was the cost of the rewrite (total hours expended) and what was the benefit (energy savings, less hours spent bug fixing, less customer support calls, ...).
All writing is rewriting and all factoring is refactoring it's a question more of how it's actually done i.e. in-situ or in parallel and of course there's no technique that's good or bad per se how many times have we all heard "Never rebase!!" that's just a crock
That nuance is important because successful companies do invest in rewrites when there's an architecture change because it's the cleanest most economical way to do it. My previous comment of Microsoft examples: https://news.ycombinator.com/item?id=19245653
Another example is Google rewriting their early web crawlers of Python/Java into C++ to make them more efficiently scalable to thousands of machines. They also rewrote the frontend web server from C++ to Java.
But some rewrites also failed such as Evernote's rewrite to C#/WPF.
I think for the topic of "rewrites bad/good", it's better to list a bunch of famous real-world case studies and extract the common criteria that makes rewrites successful.