I see a lot of fear of rewrites in the comments. Naturally, someone has posted that daft Spolsky article about how Netscape staying with the disastrous Navigator 4 codebase would somehow have been a good idea. Rewrites are one of the classic bogeymen in software.
So i just want to tell you that i've done successful rewrites!
The biggest was rewriting the e-commerce site of a well-known toy company. They had a homegrown site written in .NET, and we rewrote it on top of a commercial e-commerce framework in Java. It took a couple of years, the rewrite team was 2-5 times the size of the team that built the old site, and the client ended up with all the features of the old site, plus internationalisation, plus thorough automated tests and deployment. They seemed to be really happy with it.
The smallest was a data-aggregation server my team uses. It was written in Node, and kept crashing under load. I rewrote it in Java. It was just me, and it took a week or two. It hasn't crashed since, and we've been able to add loads more features.
Success factors in both: putting effort into thoroughly understanding the old thing; resisting any temptation to add new features unless you are exceptionally confident in how much more work it will take; thoroughly testing what you build as you go.
Everyone has done "successful" rewrites. But a rewrite that takes 2 years and requires a team much larger than the original team isn't successful in the eyes of most stakeholders and is almost NEVER what the dev team proposes as the timeline. Just because something got done eventually at great cost doesn't make it successful.
Quoting parent: "plus internationalisation, plus thorough automated tests and deployment. They seemed to be really happy with it."
Sounds like success.
There is a reason why stakeholder would like the change despite large cost - if the old system was unstable and source of stress and fires for the stakeholders, they will go for more expensive option.
Kind of like after you had a car that needed to be fixed every few weeks, you will go for more expensive higher quality one next time.
That's why a good strategy is to never put yourself in the position to communicate a timeline. If you are knowledgeable enough about a project so you can make an informed decision that a rewrite is required, the strangler pattern can allow you to complete such a project without ever formally requesting a "rewrite".
If the executives don't get a timeline from you, they'll get it from someone less knowledgeable. Estimates made in ignorance are almost always underestimates because they omit critical work the estimator is unaware of. Therefore, this (bad) estimate will also be a "lowball" estimate, but it will be the only hard number executives have to make their decision. So they'll greenlight the rewrite and more than likely put the lowball estimator in charge. Being vocal, honest, and credible about estimates and the true cost of projects is always the best move.
The strategy is to be vocal, honest and credible by estimating features, not the whole rewrite. The rewrite happens by strangling the original application, feature by feature.
Our team is doing a slow replacement of code written in Java to node. At almost every turn we find that the replaced code is 50 to 80 percent smaller and much easier to read. We're doing it slow because we are phasing in new features at the same time to make sure stakeholders get value while we do it. It's been tremendously successful and cost effective keeping the old and the new running simultaneously.
Sadly just converting poorly written code that doesn't use re-use mostly. Other than that, we're using multiple typescript frameworks for data wrangling and transport and validation without having to re-define objects as DTOs, Models, etc... And taking advantage of graphql has been, massively helpful.
>> Just because something got done eventually at a great cost doesn't make it successful
> Are you arguing that success is in the eye of the stakeholder? Sounded to me like it was a success.
I think a lot of these stories make sense if you replace software re-write with a home construction project gone overbudget, one where the overage comes out of your pocket.
I'd say that something that gets done eventually at a great cost is not successful if the case made for doing the project was an overly rosy story of how easy/inexpensive it would be to do. It really hurts when you pay for it, and that is the perspective to take.
As for success being in the eye of the stakeholder, sort of -- in my analogy, the construction company might consider it a success all the way to the bank. That project owner may consider it a success to save face. The project sponsor would likely not.
Then I think the good criteria to apply here is: what was the goal of the project?
If it was a quick cash-grab then obviously a long and expensive rewrite is deemed unsuccessful. If you go for the long haul and want to bring real value to your customers and the rewrite helped you do that, then it was a success.
Nothing is ever cheap in IT in my opinion. I found that being very upfront with the people with the money landed me more projects than trying to lie about it and reveal the extra expenses one by one over a course of a year.
"No, that's gonna be very far from 50k euros. Be prepared to pay 300k and it's not gonna be a year. Optimisically, two, realistically, two and half to three."
It works surprisingly well with experienced businessmen. They applaud the honesty and we move directly to the next point -- what are the tradeoffs and compromises.
I've done a rewrite of a major website with another HN'er in about a month of close collaboration. Nobody even realized the platform had been changed before we told them. The original team was a large number of people working part time over a large number of years. Enough cruft had accumulated that a rewrite was the only way to deal with it.
Thanks to Charles Pick (phpnode on here) and pretty tight plan we got the job done in record time. Yes, we could have done even better. But we did not have the budget and the time so we worked within those constraints and got the job done. Would have been super to work with more people, more budget and several months to do it. But as it was it was already a huge improvement over what was there before.
At work we're in the position where three of our main competitors are struggling bad after all trying to rewrite their software. Good for us, we've captured a number of their clients due to this.
We've instead focused replacing parts of the program over time. If we touch some old code that works but needs to be extended for a new feature, we usually rewrite it and bring it up to date as part of implementing that feature. Other times it might be too sensitive and we just modify a small piece of the old code and let it chug along like it has done for over 15 years.
Instead of spending time rewriting the whole shebang we can deliver features, alternatively not having a second team reimplementing everything avoids a huge financial burden.
I should add that you cannot forget the sales side when deciding on doing a rewrite. That is, how are you going to recover the cost of the rewrite?
One of our struggling competitors has, as far as I know, a decent replacement in the form of a rewritten product. But trying to recover the cost they're trying to pitch it as an upgrade. So a license to the new version is much more expensive.
Apparently a lot of clients either do not see the added value over their current version, or, given the higher price, they're considering alternatives such as us.
Perhaps. For me saying "we're gonna rewrite this program" means a specific effort over a relatively short timescale. Incremental or not (when switching platform for example).
With our approach we will likely have some code that's 10-15 years old now and which still be untouched for another 5-10 years.
I argue that’s not a rewrite because it’s not a fundamental architecture change that requires a huge shift and lift (with the accompanying monolithic deployment critical section and associated risks). It’s just an application of the strangler pattern to do incremental component changes.
This is the way I prefer to do rewrites too. In the end you usually end up with most code rewritten, without it taking too much time. Then the last bit to get it all rewritten isn't so hard.
>> So i just want to tell you that i've done successful rewrites! [...] It took a couple of years, the rewrite team was 2-5 times the size of the team that built the old site
So what makes this rewrite successfull? I am sure in that time you could have easily added the I18N, tests and deployment and refactored the most difficult parts of the code so that it would be more expandable and easier to maintain.
That it was successfully completed and the tech provides a better basis for future features? Sometimes timeframe and team size are almost irrelevant to the success of a rewrite project.
Not saying that time and money is unimportant, but sometimes time and money are the soft(er) constraints. Extreme example: the toolchain is lost and the original application can not be built anymore, but the business currently swims in money.
I am definitely not saying that adding that good stuff to the existing app incrementally was impossible. They could have done that. But for "strategic" reasons, the stakeholders wanted to replatform, and we did that.
This. I was part of a team that did a rewrite of a product at a large well-known company. Also, saw rewrites happen throughout my career that I wasn't directly a part of but were happening in the same company I happened to be working at.
Maybe I've been lucky but they were all raging successes. At the end everyone is relieved and happy and much back slapping ensues.
I like Spolsky's articles but I hate the dogmatic culture a writer like him creates. His article on rewrites is always a thing whenever I've been part of a discussion on rewrites. It's an anecdote, nothing more.
The 'Spolsky article' may be opinionated and simplistic, but the article on which it was based makes some salient points that should be weighed carefully.
People usually tend to accept that efforts to build a products are oftentimes underestimated, but somehow accepting the same kind of underestimation is hard. So what?
Even if the rewrite takes more time and money it can still be successful if it substantially reduces costs, relative to the original software.
The important part is to push through the hard part to receive adoption.
There is a trick to rewrites: start by writing a full suite of end to end tests. Once done you'll discover that you can easily stabilize your old code and make changes with this security harness. You may discover that you don't really need to rewrite anything.
If you have to maintain untested code I recommend you get yourself a copy of "working effectively with legacy code" as it is mostly a list of recipes on how to add tests to a codebase. I also recommend it to anyone starting to write something new so they can learn what is useful to test.
Forget about unit testing and start working on end-to-end tests. Selenium, Sikuli, Codeception, Wiremock, siege are the kind of tools you want more than whateverUnit. Test your applications at the UI level. Test its performances. Your client does not care about your design pattern usage: they want something which respect their specifications.
This strategy works especially well if your product provides an API and your customers or partners are motivated to work with you. Then, besides your suite of tests, you can use the customers’ products as verification. That is, if the customers products still work, then you can be confident that your rewrite works.
I’ve done this several times with success. There was a period of about 10 years where I specialized in rewriting legacy code. The first case, I rewrote 200k lines of C++ (COM/OLE) code in Java. Then, I joined a startup that had to scrap its entire codebase because of scaling issues. Then, I worked for two other companies that had acquired other software companies (with crappy code), and I led the rewrite and integration efforts. These are the notable ones.
Believe me, rewrites can be successful. They must be done carefully. But, there are lots of ways to manage the process and mitigate risks. I’ve done it at least 10 times, and I’ve never had a failure or substantial cost or schedule overrun.
> There is a trick to rewrites: start by writing a full suite of end to end tests. Once done you'll discover that you can easily stabilize your old code and make changes with this security harness.
I would call that refactoring, whereas a rewrite means starting from scratch. I agree that refactoring is the superior choice if at all possible.
Tests fossilize a design. This is generally a good thing and allow you to focus on refactoring and bug fixes with some confidence that you won't break anything. It's also especially good for products that must adhere to specific interface or API.
But when you want to re-write the last thing you want is to fossilize your design. You explicitly don't want the same system you started with otherwise it wouldn't be a rewrite, It would be a refactoring.
This isn’t necessarily the case. In my experience, the teams that want rewrites have built big monoliths with no design, lots of coupling, and lots of copy-and-paste code (basically a ‘big ball of mud’). So, what the rewrite does is introduce an architecture that decouples (often valuable) functionality into small, targeted functions and libraries. This enables easier maintenance and modification. It also fixes the brittleness and fragility.
So, the goal isn’t to keep the monolith as is. Instead, it’s to break up its functionality into small parts. Then, the parts may be joined so that the old interfaces work as they did before. But, they may also be used to build entirely new interfaces.
Thanks for putting a name to this. Happens way too often when I start rewriting old code.
I usually rename the test class of the old implementation to OldXTest and try to keep the interface of the new code similar enough to enable a reasonably quick transition into the new code with proper unit testing.
There is a trick to rewrites: start by writing a full suite of end to end tests. Once done you'll discover that you can easily stabilize your old code and make changes with this security harness.
Sometimes, but it's not always that simple. Useful software tends to interact with external systems, which might not be amenable to that sort of automated end-to-end testing. Also, the objectives of a rewrite might include enabling new integrations and/or user interfaces, which deliberately don't work as drop-in replacements for what was there before so wouldn't expect to pass the same end-to-end test suite. Automated testing is useful in the right context, but IME it's rarely the whole story and there are often data migration exercises and new integration tests to be done as well.
I don't think he said it's simple. As someone who tried to add unit tests to legacy code, I can assure you it is anything but simple. However, it is a sound approach.
>Useful software tends to interact with external systems, which might not be amenable to that sort of automated end-to-end testing.
That's what mocks are for. Writing effective mocks is an art, though. Again - not trivial.
I shall respectfully disagree with you here. I mean, yes, obviously that is literally what mocks are for, but I have never found mocks to be a particularly effective or valuable tool for testing. They can take a disproportionate amount of time to write and maintain if whatever real external system they stand in for is complicated. They are inherently fragile if that system is subject to change. Most importantly, even if those tests pass, you don't actually know whether your real system is going to work, and IMHO the greatest benefits of automated testing are found where you can systematically and repeatably exercise exactly the behaviour and interactions you might see in production.
The pre-requisite here is that you first enumerate a fully complete spec of all behavior that your clients believe it should have. This seems too high a burden for what you're trying to achieve.
My experience has seen it work well as either 1) bottom-up: pick a small enough section of the codebase such that you can fully understand it, then rewrite just that piece, repeat until complete to get an exact copy of your existing application or 2) top-down: you write a new application from scratch completely, starting with understanding what your business's goals are and how to best serve your clients
WEwLC is excellent.
IIRC, its definition of "legacy code" is "untested code".
End to end tests are great for providing confidence that the code does what it says on the box.
Though end to end tests are slower than unit tests, and it can be tricky to track down why an end to end test failed.
A "Test Pyramid" seems a good idea to me.
(Unit tests can be quick, but don't cover much of the system.
E2E tests cover a lot of the system, but aren't quick.
"Test Pyramid" suggests it's better to have more unit tests relative to E2E tests).
"Only unit tests" or "only end to end tests" don't seem like practical things to aim for.
A good rewrite is done bit-by-bit releasing the rewritten work immediately. This forces you to focus on deliverable chunks rather than code-for-code's-sake and to test your assumptions instantly. If you do not release your work there is a very large chance that it will all be for nothing.
A good rewrite happens so subtly that end-users and operators will never really realize that a rewrite is underway.
There are only very rare cases where a rewrite in big-bang fashion is indicated and even then the bulk of those will be incompetence on the part of the tech crew because they see no way to turn the job into an incremental one (or do not want to see a way).
This is 100% correct. We're in the process of doing this now with our products (and our internal systems, also), and it works well.
When you have products and systems that have been around for a decade or more, major portions of them become outdated and need to be rewritten (or largely rewritten, the two are synonymous to me). Best practices in crypto change, operating systems and hardware improve/change, etc., and if you want to keep generating value, you had better change along with it. You can see where organizations don't do this: the developers force weird constraints on IT like needing to use old, obsolete versions of operating systems because the software won't run on newer versions.
Along these lines, one of the problems that I think exists with the software industry today is an inability among developers to recognize that software sticks around a lot longer than one might originally envision. And this phenomenon only gets worse (better, for the end user) as the value provided by the software increases. It's a bit of a Faustian bargain: everyone wants their software to be used and provide value, but often don't realize the "soft commitments" being made in the background that can tie you (or the business) to the code for years (or decades).
It makes me think of a dual of modular system design, a rewrite should aim at finding or validating minimal interfaces so a subsystem can be rewritten (the goal of modules I believe) at minimal cost and uncertainty.
I totally agree with you. In most of the cases this is the most reasonable strategy, unless you really understand what you do. A lot of good code bases live this way. There are usually too many things contributed to existing code. A complete rewrite worth it only when you know for sure, that you can throw away most of the details and nuances of the existing code.
Incremental rewrite isn’t a rewrite at all. It is the better choice though. It is worth noting that refactoring and improving an area of the code without a need to change it or build new features on it is also wasteful.
Your comment is great except for the first sentence. Maybe you could elaborate on what you mean, but of course an incremental rewrite is a rewrite. All rewrites are incremental, the proposal is to choose the order of increments of a system rewrite such that it is fully working both before and after each increment, rather than start from scratch and wait to find out if the full system will work until and have feature parity after all increments are complete. I’d guess you already knew that, and just have a notion that the word “rewrite” means it needs to be from scratch and all at once for some reason?
I’m just trying to be absolute on the fact that incremental change of a single running system is more efficient and shouldn’t be confused with ‘rewrite’ projects where a second system, that is not used by the business yet, is developed while the business continue to use the ‘old’ one.
Calling incremental improvement a form of rewrite gives it a bad name :)
I’d argue that rewrites that are a second system that (hopefully) eventually get deployed once the features exceed current system is not incremental at all, but is instead Big Bang.
Even if the features of second system are built incrementally, if the business aren’t using it until it is finished then it is Big Bang.
If the plan is to change all the parts one by one until done, it's a rewrite. If the parts are changed one by one for independent reasons, it's just code evolution.
I've seen 2 non-trivial rewrites of the mainline software at 2 companies. Non-trivial = 10-100M in yearly revenue.
In both cases:
- it was promised to take 6-9 months, and ended up taking 3-4 years
- it never really finished, the old software had to remain in production (along with the new "rewrite")
- in some form or the other, while the rewrite was happening, the company lost its customer focus and/or its ability to innovate
- good people left
Instead of rewrites, make incremental changes. The eng. team should never be off doing its own thing.
Also, a good lesson from Facebook: instead of rewriting their PHP codebase, they extended the language by creating the "PHP++" language Hack (along with the HHVM runtime), and incrementally changed their codebase to take advantage of Hack:
> Also, a good lesson from Facebook: instead of rewriting their PHP codebase, they extended the language by creating the "PHP++" language Hack (along with the HHVM runtime), and incrementally changed their codebase to take advantage of Hack
I agree with most of your comment, but I don’t think this part generalizes. This isn’t feasible for most companies unless you’re operating at Facebook scale.
I agree it doesn't generalize, but I don't agree that it's because of "scale". It doesn't generalize because not every company should write its own language (and can't hire good enough engineers for this). And, fortunately, most "legacy" codebases aren't in "bad" languages, so there's no need for this.
Afaik, the team that did Hack/HHVM at Facebook is ~5 people. I don't think you need scale for this (the rewrite of the code itself is not a scale thing, the codebase is usually linear-ish in the number of engineers).
My point is: instead of doing a rewrite, be inventive and avoid it, and this is a great example.
And this made sense for Facebook because it probably would've taken 100s of engineers to rewrite all their php.
Also, not every company could hire the 5 people needed to do their own version of a hack/hhvm project. But at some point it makes sense to find those people instead of rewriting in an unrelated language
It’s not about how many engineers it takes to make a new programming language or runtime.
Facebook has the scale to invest in creating tooling, IDE extensions, core libraries, documentation, training, test frameworks, bindings, etc for a new language that they create. They also have the organizational scale and career development to make it worthwhile for an engineer to learn their proprietary language.
A competitor can also eat your lunch with a greenfield product if you are unwilling to move on from your legacies.
I guess “rewrite” is bad, but “reinvent” with experience from the former system shouldn’t be shied away from. Incremental improvements might only take you so far before competition overtakes you.
The only point I'd like to add is that incremental rewrites often have long-lasting value in terms of a culture that emphasizes continuously paying down tech debt.
far too often I have found rewrites, which in many cases involve moving to new hardware and possibly OS, tend to drag on because those involved do not know all the touch points of the current systems and also fail to properly count the investment other teams have in making connections to the original systems.
I have even seen them done to bypass a management/development team that was in disfavor with the company leadership. In effect, put together a new platform with new leadership.
You say it as a negative, but in practice, the old code is the only reliable specification.[1] Written specs are never comprehensive enough. If you rewrite based off of specs, you're bound to have both bugs and missing features.
[1] Well, if the codebase has very comprehensive tests, that's likely better than the old code. But few projects have this.
I specialize in fixing bad software projects typically taking a role of principal developer/architect. Over nearly 20 years of my experience every rewrite I took part in failed by either not delivering the improvements, by causing a lot of unplanned headaches for the business or by wildly missing all planned deadlines and cost estimates.
On the other hand couple of years back I started purposefully taking up projects in bad shape and fixing them by cutting discussions about rewrites and instead focusing on putting work where it matters -- diagnosing the problems and finding solutions to them.
I found that most of the teams I met stopped or indeed never really had any practice of constantly diagnosing and improving their process and application. This is usually the cause why the application is in bad shape but more frequently than not the team will blame the organization, predecessors or constraints like old technology they are working with. They channel their frustration by focusing on the idea of the rewrite which seems like a relief from the frustration of current codebase. Unfortunately, the rewrite is performed using the same process that failed the previous version with predictable results.
Think of a person that has messy house. This person does not have practice to keep things clean and in order. The person decides he/she will fix the issue by building another house and burning the old one with all belongings. The result is predictable.
The solution is to learn to keep things clean and in order instead of burning the old house and re-building it at great cost and effort.
I agree that in most cases it is the fault of an organization that problems were swept under the rug for far too long. There are cases were requirements changed so much over the years that architecture of system would no longer support the business without massive cost or overhead.
We have plenty of ideas on how to build new software: agile, TDD, DDD. It seems that there is no proven process for doing rewrites.
> I found that most of the teams I met stopped or indeed never really had any practice of constantly diagnosing and improving their process and application.
Can you elaborate a bit more? Using your analogy I imagine this would be someone tidying up every evening before they go to bed. But can you maybe describe what it would look like/how the process might work for a team maintaining an application?
"Think of a person that has messy house. This person does not have practice to keep things clean and in order. The person decides he/she will fix the issue by building another house and burning the old one with all belongings. The result is predictable."
I believe your unstated epilogue is supposed to be something like "Rather than build a new house, it is best to learn how to take care of the one that you already have." Great. So let's suppose that happens. Now the team has a new skill. They've learned how to be organize their code and fix the problems in the code. So should they incrementally improve the code, or should they do a complete re-write? Your analogy does not stretch that far into the future.
I agree with this, and I've had the same experience:
"I found that most of the teams I met stopped or indeed never really had any practice of constantly diagnosing and improving their process and application. This is usually the cause why the application is in bad shape but more frequently than not the team will blame the organization, predecessors or constraints like old technology they are working with."
My sense is that what these teams need is new leadership. It doesn't really matter if their code is written in Fortran or C or Javascript or Go, what they need is good quality new leadership. Once they get that, their situation will improve. However, the new leadership needs to have the freedom to take the team in the direction of those skills and experiences that the new leadership has gathered over their lifetime. If the new leadership has a lot of experience in Ruby On Rails, then a complete re-write to Ruby On Rails can be justified, because if the new leadership is good, then they will produce something good in Ruby On Rails, and it will be better than what the old organization had before.
And indeed, I think many real life re-writes happen because of exactly this set of circumstances.
An interesting point: code is (probably) not as ugly as it seems -- it just looks that way because you didn't write it. What looks like cruft is actually the accumulation of years of bug fixes and edge case handling.
... or accumulation of hacky extra features not in orginal scope and architecture? In the beginning it might have been fine code, but not enough time have been allocated to do new features right.
Rewrites might be the wrong term. "Refactoring" is a better. If you already have functioning code you don't need to rewrite it, but fix the mess.
If you have undocumented legacy code that's a mess, and it doesn't work with new quite different requirements, it might be faster to just rewrite most parts and cherry pick code that seems to do edge catching stuff and opaque interfacing with other black box systems.
I aim at spending half my time refactoring and documenting, so I'll in reality end up with at-least some time. Random Company average I have been at is probably around 0.1%.
There's a very big difference between refactoring and rewriting, IMHO. As you say, with rewriting you're throwing away functional code whereas with refactoring you're shifting it around.
Both have their place but refactoring happens regularly whereas rewriting is a more drastic option that should be used only when absolutely necessary. Having a robust test suite is very important with rewrites to prevent regressions.
There is a lot of advice here to do incremental rewrites, refactoring, unittesting, functional/system testing etc.
Undoubtfuly it is all mature and correct advise. However, I feel it is all one-sided and I ought to provide some counter points:
1. The codebase in question may be well beyond the line where any sane person would touch it, seriously.
2. Individuals experienced with the codebase, its structure, implementation, technology stack might be not available (think cobol).
3. Refactoring or incremental rewrite is a process that has to be planned and managed s.t. current product/codebase structure. Oftencase it is this very structure, which demands the full rewrite -- because of it being too convoluted to allow refactoring to take place.
4. You might want to do a rewrite to refresh the tech. stack -- language, design, frameworks -- it is ok to do so.
5. You do not necessarily need to come with the same feature set. Both you and your customers might want to cut down unnecessary cruft. Technical debt usually starts to show its signs with losing flexibility w.r.t user request to change features.
6. Occasionaly you might come up with a separate, different product, with a different name and brand -- which might turn out to be even better!
I think this is a pretty reasonable set of counter points, even if I don't agree with them.
Being part of team incremental rewrite, myself, I'd say that you're overestimating the cost of the incremental rewrite and underestimating the cost of the rewrite.
For your first 2 points especially, I'd argue if you can't even begin to incrementally rewrite a system, you're in no position to begin to plan a full rewrite.
It is mainly a matter of decision making. In my case, no refactoring was ever sanctioned by itself. Devs were having been allowed to refactor anything on a small scale and always backed up by a feature request. So when the desire to somehow "heal" the codebase is met, it is usually because it is already too "dead" to work with in the first place.
> 1. The codebase in question may be well beyond the line where any sane person would touch it, seriously.
> 2. Individuals experienced with the codebase, its structure, implementation, technology stack might be not available (think cobol).
There are people that reverse engineer binaries, find a place they can use a buffer overflow to insert a jump and then “program” by jumping around to different parts of the existing compiled code.
Your legacy cobol spaghetti code isn’t impossible to understand—-get better or better motivated engineers.
> 3. Refactoring or incremental rewrite is a process that has to be planned and managed s.t. current product/codebase structure. Oftencase it is this very structure, which demands the full rewrite -- because of it being too convoluted to allow refactoring to take place.
It does take extensive, often tedious effort, but there’s no such thing as too convoluted. A giant ball of tangled string can be pulled apart one knot at a time.
> 4. You might want to do a rewrite to refresh the tech. stack -- language, design, frameworks -- it is ok to do so.
> 7. You learn a lot in the process.
It probably is very true that a rewrite is often better for the resumes of employees. But employment is a fiduciary relationship. By cashing that paycheck you’ve agreed to put the company’s interests before your own. And rewrites kill companies.
When someone new comes in and within six months of starting starts pushing for a rewrite he should be fired and a hard look should be taken at what went wrong with the hiring process.
I hate to criticize, but people need to know that this is a bad article. If I had to summarize the errors into three sentences, they'd be "You don't know how to write software so your software sucks. That means you won't know how to do a rewrite either. Here's some math I made up."
The simplest way I can show the error? If making a new piece of software is so expensive that it's far, far too expensive to do, how come folks enter the market everyday with new software which keeps replacing all of that stuff you think is irreplaceable? And they not only replace your stuff with better stuff, they do it at a fraction of the cost you take just keeping the lights on in your shop.
Math is not going to save you now. Looking at how fast features can be deployed is sticking your head in your ass. It's counting things for the purpose of counting them. It doesn't work like that. You may enjoy counting, sgraphing, and tracking points-per-sprint or feature-speed-per-team, but nobody buys or uses software based on feature count or team speed. It's not that these things aren't important. It's that you're confusing managing things with value creation.
But the larger issue here is that organizations don't know how to create and harness value. They're really good at hiring, managing, and a few other things. But those other things don't come directly into play here. It'd be great if they did, but they don't.
You don't know how to create value. Step 1 is admitting that. Without that admission, no amount of charting or graphing is going to help. And yes, you can't rebuild your software. Probably lucky you've got it up enough to provide value right now. I'd hang on to it.
I was leading an effort a while back to look at replacing a group of 40-odd systems, tied together by batches, to run a large worldwide retail operation.
After scoping out the work, my recommendation? Build a small app to handle receiving. You do receiving at all of your locations, it's being done by several different separate systems, and it's an opportunity to write a small, cross-platform app that can be used by anybody with zero training right away.
It was shot down! Why? Because large projects aren't done that way around here
That has nothing to do with anything, yet it prevented getting started immediately.
As a hired-gun, I moved on to bigger and better things. The org dropped 100M+ on just the kind of rewrite this author is talking about before giving up in failure. (Actually they changed the goalposts so that they won, then had a big party. But there was very little done compared to the money they spent)
It's the wrong mental model. It's painful to watch, like a kid with a big hammer trying to make a large circular block fit inside a small square hole. It's not going to be good even if somehow you make it happen. It's going to be ugly as crap. You end up destroying the thing you're trying to help.
> The org dropped 100M+ on just the kind of rewrite this author is talking about before giving up in failure.
That is evidence that the author has a point. He also said, at the very end, that this is a two-part article, and part 2 will deal with how to address these problems. I would not be surprised if he advocates something like the approach you suggested in this case.
Spot on. It's such a fascinating dynamic - on the one hand, you've got large players with incredibly deep pockets, coasting on market inertia despite utterly dysfunctional internal structures.
On the other, you've got small, scrappy teams whose survival depends on quality, because they only ever have a month or two of payroll in the bank.
To me the biggest issue with re-writes is it tends to be Engineering working in a vacuum without business input. What that tends leads to is “blind” 1-to-1 reimplementation of features without reflection on which features are _still_ bringing business value e.g. if most of your user base transitioned to mobile, is it really worth reimplementing all things that were built with users IE on Windows in mind?
Or put another way, the place where you can save a ton of effort in a rewrite and increase your chances of success is in throwing stuff away, but you usually need more than just the engineering team to achieve that.
Oddly enough, my experience is that when IT asks Business "can we finally drop support for feature X? We are pretty sure that Roman Numerals are not used anymore for budget reports..." the answer is "I am not sure, maybe it will become useful again in a year or two. Let it alone".
You haven’t communicated the cost. To the business manager it’s essentially free to leave it in - so why not keep the optionality.
If you could estimate “keeping feature X costs $200,000 per year in added development time” it would be an easy decision. Through this process of cost estimating you may find that while annoying it only costs $10,000. Any ill will generated from removing it would cost more than that so it should be left alone.
This has never worked.
The project has a maintenance budget that covers everything (bug fixing, new features, migration to new versions of OS, DB etc.).
Business requires a certain number of enhancements, the development team is already understaffed and therefore they are often missing the required number of enhancements in a specific release cycle.
So the full discussion goes something like this:
"Can we remove support for Roman Numerals? We are pretty sure nobody uses these anymore, and we estimate that this would cost 20 man/days in this release but result in a saving of 800 man days over the fiscal year..."
"Not sure, it could be useful again, use these 20 days to add support for Aztec calendar to the Insurance reports - you have already postponed this twice!"
(I.e. they get the impression that somehow there are 20 "extra" days available and those have to be diverted to implements something that may have become already obsolete).
I am working on a legacy application which brings in 90% of the revenue of the company (imagine something managing bookings for an hotel chain: of course we get also revenue for what guests pay during the stay, and maybe for cups, pens, and bathrobes with our logo, but the vast majority comes from the bookings, of course).
As such, every different part of the business may request changes/enhancements - our team provides these to lots of different "business departments" across the whole company.
Again, imagine a hotel chain with hotels all over the world, each national branch may require specific changes due to local laws and regulations, or because they need to start a new incentive campaign or participate in a joint venture with a flight company or whatever.
There is one application, and N different (competing) "customers" each one considering only their own specific plans and priorities. (In case of conflicts, the pecking order will be used to solve who gets more attention: biggest hotels, or hotels in regions that bring more revenue have more "clout").
Now, when I say "we have to postpone your request for X in order to recoup 780 more days later" the guy in front of me will immediately conclude that he will not necessarily get a bigger share of these 780 days - he will have to fight for his piece just as strenuously as before, and in any case this will happen maybe in three months, and he needs his stuff yesterday - so he better insist to have his own specific request included in the release, no matter if it costs 21 days, 20 days, 5 days: he wants this to be done because the rest of his business needs it for a specific date, and everything else is just a way for IT to postpone his request once again.
In other words: everybody wants their own specific request implemented as soon as possible, and anything else has absolutely zero interest for them. Especially if it is some promise of future "gains" from guys who are constantly late.
In my experience, this is not so uncommon when you work on an app that has been developed internally (and so there is no unified marketing department which represents a single stakeholder) - note also that precisely because we have to work for a myriad different "internal customers" we tend to accumulate "technical debt" at a faster rate: there is only one codebase, and has to accommodate all these pesky requirements from all over the world...
If you have multiple teams competing for features, it shouldn’t be too hard to sell it that way.
“If we could open up 780 days on the schedule - what new features would you like?” At this point you want the client dreaming of all the extras they never thought they’d get to have.
Then at the end you mention the 20 day delay. At this point they will feel the “loss” of the new features they just imagined. It’s an extremely powerful sales technique that works almost everywhere.
Given your earlier comments about the extra 20 days they seem particularly susceptible to this technique.
If you announce that you have 4 man years free for new tasks, in a team that may not even have 4 developers, after abandoning a feature that already existed and worked, you're delusional and so is the stakeholder if he believes it.
I agree on the principles nonetheless. You want clients to write a formal request for new features, then development has a backlog of requests and can prioritize them.
Clients might not like the prioritization but that's life, limited bandwidth, it's simple to show there is too much to do and not enough resources.
I’m using the numbers I was given as I’m not in a place to judge if they are realistic. Certainly if you make false promises no amount of sales techniques will save you in the long run.
The 200000 vs 10000 (and therefore the 20 vs. 780 man days) come straight from your example.
Which is part of the problem: while I know for sure that are a lot of things that should be refactored/changed/improved/cut off it would be very difficult for me to give an estimate of "how much resources we save over the next year" - as a simple example, I could revamp significantly the GUI (God knows how much it needs it)... and in the next six months I get 85% of the requests for changes having to do with business logic, financial batches running at the end of the month and stuff like that, maybe because there are new regulatory hurdles to clear like GDPR or The EU Travel Directive.
Then the promised extra resources that would have become "available" fail to materialize, and the whole initiative is considered a failure. Good luck trying this again on the next year.
The benefit depends on what part of the system I decide to optimize. If I guess right (and enhance something impacted by many/big requirements in the future) I will reap good results. If I enhance something which stays basically untouched for the rest of the FY I will be considered a fraud or misguided.
Applications developed internally don't necessarily have a roadmap (or might have it and abandon it 2 weeks into the new year because).
Problem is, nobody will consider you a hero for sticking to the (now obsolete) plan.
In general you can't rely on users to give you a reliable answer to any question like that.
I suppose the way to deal with it is a feature switch, turn it off and if nobody notices after a year then remove the code. That or add some instrumentation to identify what is being really being used and what isn't.
I once worked for a company where we as IT decided that there would be no rewrites without new features. It was basically impossible to get funding for rewrites that would not yield new features for the business. I felt this was the best for both business and IT.
I've had a lot of success with rewrites where the outcome is a lot less functionality, but with one or two new features that have a huge amount of value.
For example, rewriting a crufty ordering system that drops 50% of the edge-cases that have accumulated over 30 years, but making it work on mobile devices.
One of my employers once had a queue for pretty much every asynchronous action that had to be done in the platform
The workloads were categorized by type and that script decided on the order by base-ing a timestamp and using the resulting integer as an if for each category. This mostly worked,too. Sometimes, if previous workloads took too long, the following category would not be triggered for several cycles... But most of the time, everything was done in a somewhat timeley manner.
That is a rewrite that needed to be done, and yet would never be doable with your strategy.
There is no new business feature to add, but it still needs to be fixed because as each workload increased, the implemented feature got less stable.
Yep, this is where I've had a lot of success with rewrites: where a lot of the accumulated features in the existing system are no longer required by the business, and you can make a much simpler system to serve the remaining use cases
> it tends to be Engineering working in a vacuum without business input.
Anecdotally, I’ve had precisely the opposite experience at times. Though I can tend toward what you described as a fault, I’ve also experienced abject need to rewrite internally and externally produced software to fulfill business requirements.
Some businesses decide that they want a new thing and find some developers to work on it. Neither seem to care about existing systems, or integrations. It will be one more legacy soon, if it ever gets anywhere.
If the old software is convoluted, it may be because the old business processes are convoluted.
A software rewrite should be done in the context of a business process redesign. Consider how information flows through the organization. Do you really need to do all the things you're doing now? Should different people or departments get different information? Should the product you're selling or the service you're providing change? Should parts of the process be outsourced or insourced?
If an IT department is considering a big rewrite, but doesn't have the authority to look at the business as a whole, then the project is being managed at too low a level. Rewrites make the most sense in the context of a change in the underlying business.
"We'll do a rewrite, simplify everything, get rid of all the edge-cases."
Guess what? It turns out that all those edge-cases were there for a reason. You needed them. And by the time you've added them back, or equivalents, two things have happened. First, you've burned an order of magnitude more time than you budgeted for the rewrite. And second, you have a codebase every bit as gnarly as the one you started with.
First rule of rewrites: don't do it.
Second rule of rewrites (for experts only): don't do it yet.
Sometimes the reason has not been relevant for ten years, though, like when a codebase is filled with horrible hacks to account for the fact that you only have 64MB of RAM, while running in a vm with 4GB.
That is right, but you should be able to point at (preferable several) such assumptions that went into the original system, and verify that they are no longer necessary.
Maybe you used to have much less RAM in the past, but also much less data, or not as many users etc.
I worked on a java swing app that was deployed to a phone center as a Citrix app. Each instance was allowed only 64mb. Use if Citrix sidestepped all java version and deployment issues. Anecdote why memory limits can be relaxant today.
Junior as in
* have been exposed to few real-world development projects, or,
* have little understanding for the interplay between development and business.
(The junior developer may however be very good at programming in languages X, Y and Z.)
Many developers will be junior under this definition for all of their career.
The exception I have seen from this is consultant sales ppl, who want to use the hype of the new often untested (or for the task non-optimal) technology or paradigm X.
When the original system was written by developers who were not junior, as is often the case with large successes systems that have survived, the result from letting junior developers rewrite the system, will be predictable.
Tip: Be cautious when listen to developers that talk about “technical debt” and similar. Make sure they are not junior developers under the definition above or are just not that good at reading code and understanding real world systems. Similarly, when listening to a sales pitch by a consultancy firm, make sure they have your interest in mind.
One system I worked on was an information system in VB6 which was being used as a base platform to build domain-specific ERP implementations. Deployment in their environment to anything non-browser-based was a nightmare. There was overlap between domains although there was shared data (customer and supplier information, product database, that kind of thing). The obvious thing to do here was to let each domain (i.e. team) build their own solutions in the most effective platform for their domain whilst integrating via defined interfaces. That's not what happened; when I left few teams had anything working and those who did spent the majority of their time on deployment issues.
Another system I worked on was initially built by a very junior team under tight time and budget pressure under leader who believed in never saying "no" to the customer. They created a horrible mess (most of the code was in a couple of files). The system was deployed by two/three customers. Testing was done during deployment; the first version worked well enough to go into production. A year later everything had ground to a halt; there was no formalised testing and the developers couldn't fix one bug without introducing several more. One very smart developer started to "rewrite from the inside" by implementing new features using clean and modern techniques. At that point I took over the team. We doubled down on this "rewrite" whilst adding manual tests (and later automated tests) to cover the most important flows. We only rewrote features when they needed to change and in some cases removed functionality when particularly hairy features no longer worked.
Otherwise I advise against rewrites; if the system was reasonably built and the platform viable then rewriting is just a waste of money.
The edge cases aren’t removed, they are considered in the architecture of the rewrite while they were square pegs in round holes in the old system.
So rewriting and knowing the full scope is a luxury that allows a much better design.
The problem is that in the new system there will be new cases that don’t fit the new design.
Rewriting shouldn’t be a decision taken lightly but it also shouldn’t be avoided at all cost.
The alternative isn’t massaging your old system when the technical debt weighs it down too much. The alternative is just maintaining it without improving it, which eventually sees the system overtaken by competitors.
This is a harmful tautology. There really is a such thing as bad code. Some edge cases are introduced by the bad code itself and changing the paradigm can make many difficulties disappear.
Having the same people who wrote the code work on the rewrite is silly though. How could they be expected to produce something better?
“I divide my officers into four classes as follows: The clever, the industrious, the lazy, and the stupid. Each officer always possesses two of these qualities.
Those who are clever and industrious I appoint to the General Staff.
Use can under certain circumstances be made of those who are stupid and lazy.
The man who is clever and lazy qualifies for the highest leadership posts. He has the requisite nerves and the mental clarity for difficult decisions.
But whoever is stupid and industrious must be got rid of, for he is too dangerous.”
I would say the 4th class of people are the ones likely to churn out a horrifying codebase. Such people are also unlikely to be able to recognize and learn from their mistakes.
But whoever is stupid and industrious must be got rid of, for he is too dangerous.”
I would say the 4th class of people are the ones likely to churn out a horrifying codebase. Such people are also unlikely to be able to recognize and learn from their mistakes.
These are the people that do “negative work”. They cause other people to do more work than if they weren’t there.
I disagree. The quote is for where to place personnel in a leadership hierarchy. A leader without the appropriate aptitude is obviously harmful; but I would say there are actually no grunts in software. Every developer has to lead at least themselves.
> Having the same people who wrote the code work on the rewrite is silly though. How could they be expected to produce something better?
Only if you have really, really poor developers on your team.
Every decent developer learns the pros and cons with how they implemented something. They'll bring that with them to the next time they need to implement something similar. And again, and again.
Not necessarily. I’ve worked with an “architect” who started at the company as the only developer and as it grew became the dev lead just because of seniority. He kept making the same mistakes and was the definition of an “expert beginner”.
He ran away everyone who was brought in to bring some outside perspectives. The only people left are those who gave up and go with the flow because they can’t find a better job without moving (small city with only three major technology employees) and those who don’t know any better.
Yup. I'm in the midst of rewriting a certain part of a robot fleet system. A lot of what we are replacing is obsolete or never needed or poorly implemented. But the first generation was good for its own reason: get a functioning system to market before everyone else and start learning everything about what's right and wrong with the design.
I really hate this cute quote about rewriting because it's elitist (you have to be an expert to do it right) and generally sells the false idea that rewrites aren't a valid part of iterative product building.
Yes, there is often a lot of domain knowledge embedded in that legacy code. The sort of information that is difficult to tease out of the business or understand by looking at the code.
A rewrite can also mean a gradual change. A great example of this is the rewrite of 0install in OCaml from Python [0]. With such an approach you change module by module with old code and new code working side by side. It is of course not so easy, but it makes it possible to do on-line. A bit like with bits of Rust code that go in Firefox with each version.
I would like to know if anyone participated in a similar gradual rewrite and how it went.
It may be very hard for some architectures and also sometimes wrong architecture (prototype became production, some features were dropped along the way, a new feature is dog slow) may require big rewrite, even breaking APIs, then it would be impossible. But if one keeps API one can at least do comparison tests as more features are implemented.
Rewrite from scratch is a very risky business, especially if it is conducted under the technical view only. The only exception is when the rewrite-team is the same as the original development team, which is almost never the case, except in the continuous delivery model.
For me, the consequence is clear:
- The rewrite cost is so high, it never justifies to do architectural and software engineering sloppily, even in an agile approach.
- Rewrite only critical pieces of software is always a better approach to prefer. In most cases, rewrite just a few pieces of software can satisfy the requirement at least to 80%, be it better performance or higher stability or interfacing with other software.
- Therefore, for any big piece of software, modularisation is always helpful.
> As people move off of the team, features are forgotten and misunderstood; but as long as those features continue to work, there will be customers continuing to depend on them.
You have this in the non-rewrite scenario, too, it's just part of the technical debt, so it's less obvious to see. And it will bite you just as hard if you need to update the feature.
So because rewrites are good at uncovering unknown unknowns, and it might be better to say that the technical debt was underestimated.
I'm approaching the rewrite of one old system in the next year (not really large, I'm estimating that this rewrite will take 2 years for few developers, probably 100-300 KLoC in the end). It's really hard to extend that system. It uses Oracle 9. It's written with Delphi 7 with half of the code being autogenerated PL/SQL procedures (those visual tools for users, too bad that users don't use them anyway). It's distributed, because in 1998 there was no network access everywhere, so there's hierarchical export-import over CD all over the country (of course now everything is connected, so those export-imports are just mailed). Build process is another nightmare. It consists of several DLLs. Every DLL is a separate project which uses its own set of components. You need separate VM to develop and build every component. Sources for some components are lost. DB schema is insane.
Customer wants to migrate from Oracle (because it costs quite a fortune) to free PostgreSQL and he wants to use browser. Also a lot of things changed in those 20 years and many functions are just not used anymore. I think that rewrite is quite justified in this case (will rewrite in Java).
Huge, ugly, broken legacy systems are a bear to work with and I've mostly spent time working on them when not at startups. The most successful project of my career was a facade-and-rewrite in which I played a significant role and has billions of $ of annual revenue behind it now, so I think I understand this issue fairly well. I also played a significant role in another similar project that, after I left, spiraled completely out of control and became a disaster that, 8 years later, is still not finished.
The "this is dumb, we should rewrite it" reaction people tend to have when exposed to very large, complex legacy software systems is totally understandable, but really comes from a fundamental misunderstanding of the forces that drive software development at organizations which have legacy code bases. Yes, the re-write, if it gets "done", is basically guaranteed to be "simpler and cleaner" when it arrives; of course it is! It is being compared to a code base that's 20 years old, not just from an era where the technological compromises were completely different, but also from essentially a different world of what was acceptable at the time. In addition to that, all software that lives long enough tends to become a mess. Let's see how the rewrite is in 20 years.
Oh, but wait, this time will be different. This time we won't make quick fixes and little hacks, we'll be diligent about requirements, we'll refactor and clean up when change is needed ...
Everyone who has not should read "The Big Ball of Mud" ( http://www.laputan.org/mud/ ) which is about the clearest description I have seen of the reality of legacy systems.
My team recently completed a rewrite of a major system. It was approached with the following goals.
- Port from Java to C#. The Java teams no longer maintained it beyond critical fixes. Many original experts were gone. My team had more vested interest and expertise. We wanted to be in charge of this system, and we work in C#.
- Switch from horizontal processing to vertical processing. The original system would query one record, query one associated record, query yet another associated record, etc. to form the complete picture of one entity. The new system reads and writes batches of homogeneous data. (Kind of a leap from OOP to DOD in terms of database accesses.) This optimization was largely necessary as the old system could not keep up with load spikes.
- Along with the previous change, we completely separated the data processing into two giant phases: one read-only step and one write-only step. This allows the read-only step to target replication databases. This reduces the stress on the primary database, which is a huge win.
- Detach the system so that it can be run, tested, and released in isolation. The original system was part of a much larger whole, so any fixes or enhancements had to wait for weekly or biweekly monolithic releases.
Everything you expect from a rewrite happened but not to a devastating degree. It went overbudget, missed key bug fixes, etc. but we worked through them and came out the other side with all the wins we hoped for. Being able to deploy multiple times within a single day is amazing. Our iteration time on new bugs or features is very small. The overall architecture is simpler and easier to navigate. I have confidence that a member of my team could enter the code base cold and fix a bug within a day.
I don't think it had a visible cost in our case. For starters, my team understood the system adequately to not need to reference the original source code so closely. (Much of the work happened before even beginning to read the original source code.) Secondly, we all had enough Java exposure to be comfortable porting behavior over when it was needed.
Coming from data warehousing. Rewrites are normally instigated to get rid of data silos. Put everything in one place and model it the say way. Very few of these projects are ever finished and the irony is that each project ends up being a data silo. It is not unusual to have three so data warehouses within an organisation. Each just slightly different from the next but with 80% the same information. Turns out that last 20% in the 80/20 principle is very hard.
The author's formulae leave an interaction out: as the project is delayed, the catch-up costs increase as a consequence. Furthermore, if the delay leads to an unanticipated shifting of resources back to the old system to keep it running, that can add to the delay. Catch-up cost has a dependency on undiscovered scope.
It is also possible that some of the unknown scope becomes obsolete, but if that is happening, then you are already in a long-drawn-out process.
As a general rule in my experience, a re-write will take at least as long as writing the current software. There's a reason that much of our modern banking infrastructure still lives in COBOL. But that's not a bad thing. It makes a ton of sense. The cost/benefit of a rewrite should be done beforehand and this should be taken into consideration. You'd be surprised at how rarely that happens. It's a false and incorrect assumption that re-writes are cheaper or faster (you are basically re-learning the entire specification of an application). But re-writes also offer a ton of opportunity and benefits and if those are important to your business, then proceed with a re-write.
If code reached the point where it can't evolve anymore - it should be rewritten. Otherwise it will be dead code and there's not so many fields where dead code can be used without killing everything around.
Rewrites mostly fail because business logic is assumed but never fully understood at a detailed level. Multiply by the size of the rewrite and things get out of hand quickly.
The best way to go through a rewrite or upgrade is to get the business involved at the start and throughout the process. If you fail to do this when something is missed, overlooked or done differently and you will be at fault. If you include them it becomes a not feature not necessary.
This is a fascinating thread on HN because of how divisive the issue of software rewrites is. I wonder, what do people think of rewrites that are driven by emergent scale and business requirements (rather than technical bitrot or code smells)?
I've been on a project where we had a working system, but it had some severe technical platform & product value limitations, and we knew those limitations were costing us real $$$, both in support burden and market share vs legacy incumbents and competitors.
Plus, we had a "ticking time bomb" because it was a large-scale data system, but the prototype (which became prod) was not designed up-front to handle horizontal sharding, and we were at the limits of vertical scaling, and were projected to hit the "max limit" of that vertical scaling within 12 months, given current growth rate.
Thus, we began a rewrite, with full knowledge of how dangerous it was -- we even circulated Brooks's essay on "the second system effect" and had several team discussions about it during the specification stage of the rewrite.
In the end, the project was a success, and powered 3+ years of scaled growth (the current live data storage of the system is 100x what it was when the rewrite began, and our "hard limit" was around 2x). The rewrite also helped us make the system more scalable, competitive, and mature, not by throwing away edge cases, but by choosing an architecture that didn't cut corners on areas that, we discovered during user feedback from the v1, were non-negotiable core areas of value for our customer use cases. We had relaxed many requirements in the "prototype stage" of the v1, merely in the interest of getting something working out the door in front of customers.
The last piece of de-risking we did is to run both systems in parallel with our users for several months. This allowed us to e.g. let 10% of our users into the new system at first, ensure we weren't breaking any of their use cases, then let 20% in, 50% in, and so on. We could also do user interviews throughout. Since the new system involved not just a better data backend, but also faster response times, a modernized UI, and many new features, lots of people wanted in. We even had a waitlist, at one point.
Then, we cut the stragglers over, and cut the old system loose -- which felt great, BTW! Running two production systems in parallel isn't easy, but was absolutely the right thing to do.
With hindsight being 20/20, we feel firmly that our first system was a "prototype that went to prod", and that we followed Brooks's advice to "plan to throw one away, because you will anyway". And that we executed a "successful rewrite". But it certainly wasn't easy.
I'm really proud of that project, but I also feel it was a bit of a harrowing experience, especially near the end, when we were concerned some "showstopping bugs" were going to keep the progress bar at 99% for a couple extra months. But we made it through.
Perhaps the reason my outcome is better is because the need to rewrite wasn't driven by a framework or architecture du jour, but by real business requirements and real scaling requirements. Even then, I think sometimes those requirements can be overstated and the ability for an existing architecture to cope can be understated. I feel confident we made the right call, but I think it takes real expertise -- and healthy dose of skepticism -- to take on the full rewrite risk with eyes wide open.
(p.s. now, 3 years later, the same team is being forced to rewrite a significant portion of the backend, not for any business requirement or scale reason, but because of bitrot of a stable open source database engine version which needs to be upgraded to avoid EOL, and wherein the new version introduces backwards-incompatible breaking changes to the API and schemas. At least in this case, it's "only" a backend migration, and not a total rewrite. But, I'll tell you that it sucks to realize this is just required maintenance, thus a pure development cost with little customer benefit, rather than a project to introduce a step-level change to the product and business. C'est la vie!)
My team has been rewriting a winforms application in JS. We just reached feature parity after ~4 years. In order to continue delivering new features without having to catch up with the old application, we released both the new and old applications together. This strategy worked well for us, as it allowed customers the ability to fall-back to the old application and create requests for missing features in the new application.
IME, the chief roadblock to a successful rewrite is the organization: an organization that produced poorly-factored, unmaintainable code that NEEDS a rewrite is not well-equipped to do something different the second time.
i think the point to start rewriting is when nobody still on the team can maintain the product. i recently got pretty close. i had been researching something for months (speeding up a login), but to no avail. so to avoid looking even more incompetent i said, you know what, an SSO app should be relatively simple to create. just slap some strings together and youre done. it would be quicker than banging my head against the wall for another three months. but somebody on another team still had knowledge and this ultimately gave us the idea to fix it.
It's pretty obvious what part II will be, considering Part I is basically arguing that a full, ground-up, greenfield, throw-away-everything-and-start-from-a-blinking-cursor rewrite is an expensive mistake.
Half the comments on this thread have already alluded to it: A careful, piece-by-piece replacement, delivering the new system incrementally (running in parallel with the old system if need be.) Or even a gradual refactoring of the original code base if the base technology wasn't the problem. Read the success stories in this thread - you'll see they tend to follow this pattern.
I think the "undiscovered scope" part is where most people trip.
You can think of software and how it fits with its users purposes as having multiple layers of features, interactions and details.
The top few features usually each have a few sub features that you have to tailor correctly to your users work flows, each of these sub feature often have sub details, interactions with other features, corner cases, data formats or cross compatibility considerations.
It's a fundamentally mathematical issue. Adding a layer of detail is an exponential proposition and the exponential function is explosive in nature. If your software only has 3 major features and each have 3 sub features, that is 9 things to consider. If each sub feature has 3 corner cases you need to get right, that is 27 things. And if each of these 27 things have 3 details that gets you to 81 considerations. The next level is 243. Each of the 243 points tend to take a similar ammount of time to plan out and build weather it's at the top or the bottom of the pyramid.
As a piece of software evolves over the years, the intricate details at the bottom get sculpted out. The software's fit to its purposes can become very fine.
The thing is, people tend to think of software in terms of its top 2-3 layers of details, its 27 most important features. The finer complexity is often not as visible and it's just difficult for humans to keep so many items in mind.
This is true of any complex technologies. People tend to think of cars as machines that have properties concerning speed and direction and suspensions and breaking but rarely think of the complex chemistry of the fuel aeration and combustion process, the complex internal forces and velocities of parts in the transmission, the carefully tweaked metallurgic alloy work that has gone in each of the thousands of metal parts, the carefully chosen properties of the plastics and the rubbers, the thermal properties of everything, the hundreds of electronic systems, the analog circuits, the thousands of specification lines of many dozens of communication protocols for controlling all these parts, the entertainment system, acoustic properties etc, etc, etc. All these things took decades to evolve and refine.
It's easy when planning a rewrite, to plan out the few dozen items visible at the top of the iceberg and underestimate the amount of important details hiding bellow.
Sometimes, some of the details might not be important, but often times they are the essence of your business case and at the core of the favourable economics of your product. It's the years of knowledge accumulated in the subtle details that provide you with a moat and makes it difficult for your product to be copied.
If you rebuild with mostly the top few dozens of features in mind, and only vague ideas about everything else, you are likely to be creating a commodity solution, one that your competitors will have a much easier time to copy than your older finely tailored solution.
A closely allied idea to "never rewrite" is "you can always refactor shit into gold." For the first ten years after reading the Refactoring book I believed the second claim was true. Don't get me wrong, I think you can make great improvements to a code base with Refactoring. But sometimes a re-design really is needed.
"except when the original version used MongoDB as a relational store and a home-grown web framework that was a poor mans' Django/Rails (except slower, undocumented and incredibly resource hungry) where moving bit by bit to PostgreSQL wasn't an option, then totally rewrite it properly"
So i just want to tell you that i've done successful rewrites!
The biggest was rewriting the e-commerce site of a well-known toy company. They had a homegrown site written in .NET, and we rewrote it on top of a commercial e-commerce framework in Java. It took a couple of years, the rewrite team was 2-5 times the size of the team that built the old site, and the client ended up with all the features of the old site, plus internationalisation, plus thorough automated tests and deployment. They seemed to be really happy with it.
The smallest was a data-aggregation server my team uses. It was written in Node, and kept crashing under load. I rewrote it in Java. It was just me, and it took a week or two. It hasn't crashed since, and we've been able to add loads more features.
Success factors in both: putting effort into thoroughly understanding the old thing; resisting any temptation to add new features unless you are exceptionally confident in how much more work it will take; thoroughly testing what you build as you go.