This is a fascinating thread on HN because of how divisive the issue of software rewrites is. I wonder, what do people think of rewrites that are driven by emergent scale and business requirements (rather than technical bitrot or code smells)?
I've been on a project where we had a working system, but it had some severe technical platform & product value limitations, and we knew those limitations were costing us real $$$, both in support burden and market share vs legacy incumbents and competitors.
Plus, we had a "ticking time bomb" because it was a large-scale data system, but the prototype (which became prod) was not designed up-front to handle horizontal sharding, and we were at the limits of vertical scaling, and were projected to hit the "max limit" of that vertical scaling within 12 months, given current growth rate.
Thus, we began a rewrite, with full knowledge of how dangerous it was -- we even circulated Brooks's essay on "the second system effect" and had several team discussions about it during the specification stage of the rewrite.
In the end, the project was a success, and powered 3+ years of scaled growth (the current live data storage of the system is 100x what it was when the rewrite began, and our "hard limit" was around 2x). The rewrite also helped us make the system more scalable, competitive, and mature, not by throwing away edge cases, but by choosing an architecture that didn't cut corners on areas that, we discovered during user feedback from the v1, were non-negotiable core areas of value for our customer use cases. We had relaxed many requirements in the "prototype stage" of the v1, merely in the interest of getting something working out the door in front of customers.
The last piece of de-risking we did is to run both systems in parallel with our users for several months. This allowed us to e.g. let 10% of our users into the new system at first, ensure we weren't breaking any of their use cases, then let 20% in, 50% in, and so on. We could also do user interviews throughout. Since the new system involved not just a better data backend, but also faster response times, a modernized UI, and many new features, lots of people wanted in. We even had a waitlist, at one point.
Then, we cut the stragglers over, and cut the old system loose -- which felt great, BTW! Running two production systems in parallel isn't easy, but was absolutely the right thing to do.
With hindsight being 20/20, we feel firmly that our first system was a "prototype that went to prod", and that we followed Brooks's advice to "plan to throw one away, because you will anyway". And that we executed a "successful rewrite". But it certainly wasn't easy.
I'm really proud of that project, but I also feel it was a bit of a harrowing experience, especially near the end, when we were concerned some "showstopping bugs" were going to keep the progress bar at 99% for a couple extra months. But we made it through.
Perhaps the reason my outcome is better is because the need to rewrite wasn't driven by a framework or architecture du jour, but by real business requirements and real scaling requirements. Even then, I think sometimes those requirements can be overstated and the ability for an existing architecture to cope can be understated. I feel confident we made the right call, but I think it takes real expertise -- and healthy dose of skepticism -- to take on the full rewrite risk with eyes wide open.
(p.s. now, 3 years later, the same team is being forced to rewrite a significant portion of the backend, not for any business requirement or scale reason, but because of bitrot of a stable open source database engine version which needs to be upgraded to avoid EOL, and wherein the new version introduces backwards-incompatible breaking changes to the API and schemas. At least in this case, it's "only" a backend migration, and not a total rewrite. But, I'll tell you that it sucks to realize this is just required maintenance, thus a pure development cost with little customer benefit, rather than a project to introduce a step-level change to the product and business. C'est la vie!)
I've been on a project where we had a working system, but it had some severe technical platform & product value limitations, and we knew those limitations were costing us real $$$, both in support burden and market share vs legacy incumbents and competitors.
Plus, we had a "ticking time bomb" because it was a large-scale data system, but the prototype (which became prod) was not designed up-front to handle horizontal sharding, and we were at the limits of vertical scaling, and were projected to hit the "max limit" of that vertical scaling within 12 months, given current growth rate.
Thus, we began a rewrite, with full knowledge of how dangerous it was -- we even circulated Brooks's essay on "the second system effect" and had several team discussions about it during the specification stage of the rewrite.
In the end, the project was a success, and powered 3+ years of scaled growth (the current live data storage of the system is 100x what it was when the rewrite began, and our "hard limit" was around 2x). The rewrite also helped us make the system more scalable, competitive, and mature, not by throwing away edge cases, but by choosing an architecture that didn't cut corners on areas that, we discovered during user feedback from the v1, were non-negotiable core areas of value for our customer use cases. We had relaxed many requirements in the "prototype stage" of the v1, merely in the interest of getting something working out the door in front of customers.
The last piece of de-risking we did is to run both systems in parallel with our users for several months. This allowed us to e.g. let 10% of our users into the new system at first, ensure we weren't breaking any of their use cases, then let 20% in, 50% in, and so on. We could also do user interviews throughout. Since the new system involved not just a better data backend, but also faster response times, a modernized UI, and many new features, lots of people wanted in. We even had a waitlist, at one point.
Then, we cut the stragglers over, and cut the old system loose -- which felt great, BTW! Running two production systems in parallel isn't easy, but was absolutely the right thing to do.
With hindsight being 20/20, we feel firmly that our first system was a "prototype that went to prod", and that we followed Brooks's advice to "plan to throw one away, because you will anyway". And that we executed a "successful rewrite". But it certainly wasn't easy.
I'm really proud of that project, but I also feel it was a bit of a harrowing experience, especially near the end, when we were concerned some "showstopping bugs" were going to keep the progress bar at 99% for a couple extra months. But we made it through.
Perhaps the reason my outcome is better is because the need to rewrite wasn't driven by a framework or architecture du jour, but by real business requirements and real scaling requirements. Even then, I think sometimes those requirements can be overstated and the ability for an existing architecture to cope can be understated. I feel confident we made the right call, but I think it takes real expertise -- and healthy dose of skepticism -- to take on the full rewrite risk with eyes wide open.
(p.s. now, 3 years later, the same team is being forced to rewrite a significant portion of the backend, not for any business requirement or scale reason, but because of bitrot of a stable open source database engine version which needs to be upgraded to avoid EOL, and wherein the new version introduces backwards-incompatible breaking changes to the API and schemas. At least in this case, it's "only" a backend migration, and not a total rewrite. But, I'll tell you that it sucks to realize this is just required maintenance, thus a pure development cost with little customer benefit, rather than a project to introduce a step-level change to the product and business. C'est la vie!)