Hacker News new | past | comments | ask | show | jobs | submit login
How to Do a Full Rewrite (badsoftwareadvice.substack.com)
91 points by tate on Sept 24, 2023 | hide | past | favorite | 90 comments



Completed a full rewrite of many components of the Kraken.com backend in about 4 years.

The new system is around 1.5M loc of Rust. There was no serious alternative to rewriting, sometimes you find yourself in a corner and need to fix issues, and pay the price.

I wrote about it 3 years ago here https://blog.kraken.com/product/engineering/oxidizing-kraken...

Everything in that blog post still rings true and hindsight is that it were were right. But it was a massive grind and required extreme dedication to get it done, for a variety of reasons that work was very taxing.

We also didn't stop feature development and kept the two systems running concurrently (which explains why it took so long, also growing and training a new team 10x the size took time, so there are many factors).

I'm also against rewrites if I can help it, but reality is complex and sometimes we can't help it. Now however, since we removed the last pieces of legacy that were preventing larger DB schema changes (or required massive, unreasonable changes to the legacy systems), we've been shipping faster and easier than ever and caught-up on a lot of the accumulated backlog, including some of the more ambitious projects that were unthinkable in the legacy systems due to limitations.


Huge fan of Kraken.

Looking back, is there anything that you would have done differently? I find that half or more of the rewrites that I have dealt with have been driven by all the wrong motivations. You get inevitable turnover and at some point people dislike code that they didn't write themselves and push for a rewrite, maybe changing the stack to something trendy, justifying it with thin arguments. Once the rewrite starts the company ends up treading water for years while incurring a ton of costs. For me, I think only 1 rewrite that I was part of was a good decision in my 15 years in tech. If I could go back in time, I think I would kill all rewrite discussions the moment that someone first whispers the idea.

How did you guys enjoy switching to Rust? I assume the safety and performance benefits for the trading system are a huge plus (didn't Kraken trading go down for an entire week a few years ago?). Did you also rewrite the webapp backend in Rust as well? How has staffing and budgeting been affected? I would assume that the supply of Rust developers is much lower unless you train them in house. Rust sounds fun, but I can't imagine trying to justify a rewrite of a legacy system, a major tech stack change, and training/building a new team all at the same time.

Sorry for the onslaught on questions. The "rewrite it in rust" fever has spread to my work and I'm fighting myself on how to respond.


With hindsight, considering the cards we were dealt, there's not much I would have done differently. If I had known better before, I would have ensured stronger buy-in because after a while our internal stakeholders were often pushing back on the effort, and that led to concessions where throw away code in the legacy systems was built even for weak business outcomes.

Overall I share your concerns. Having the right reasons to rewrite is key. I believe this blog[1] about software as theory building does a great job at describing the challenge with software gardening, and the times where a rewrite is the solution are few. Even then, it's critical to handle the rewrite in ways that can work - in our case, we chose to progressively eat the legacy software without making major changes when we could avoid them. The legacy software we had was mostly the results of one man heroics and traded off performance and availability for correctness and security. It also was designed to be maintained by a small group. Solid choices if you are early Kraken - but as many successful startups, we were victim of our success and we needed it all.

When it became clear that we had to rewrite the stack (the 2017 3-days shutdown happened just before that realization), those in charge at the time decided to experiment with Rust. It was a crazy bet in early 2018, it was still Rust 2015, no NLL, no async, far rougher ecosystem. The fact that it became successful enough to warrant pursuing a full rewrite is to credit on some lucky hires who made it a success.

In that regard, Rust was a very strong talent magnet. In my experience, having hired 200+ Rust engineers over the last 5 years, there are a few kind of engineers attracted to Rust: (1) some just like shiny things/hype, (2) some are perfectionists and never complete a project, (3) some just are doers who have found that Rust is a particularly effective language.

Overall, Rust has been a great to hire for. Many engineers out there want to use Rust, even if it's their 1st Rust professional experience. We were also known in the Rust community for hiring for full time Rust, probably also the place currently with the highest density of Rust talent (there are massive companies with more Rust devs, but smaller % overall). Budget wise, our Rust engineers are not paid particularly better than other engineers in the company, but the compensations at Kraken are generally in the higher tier.

At the risk of sounding boastful, in my experience Rust is reasonably easy to learn for experienced/strong developers (we have some very young outstanding Rust devs as well, most of the time they learned before joining). Average developers struggle and may never become productive. Again, we have an engineering excellence culture so it is okay for us, YMMV.

Re scope, yes we use Rust for everything in the backend, including CRUD type of work like Web APIs. We've found we're at least as productive than other languages (Go / Java+Spring / Ruby / PHP) while having far fewer incidents, and easier maintenance / cheaper KTLO. Rust's ability for reuse is excellent which means that there are very strong network effects when having more services in Rust, including the Web layer.

A nice "side effect" of a full Rust stack is that our p99.9 latency internally is usually stable around 3-4ms for most operations, even though multiple services are involved. That's coming from a much higher baseline with much more deviation across operations providing the same functionality (60-100ms).

Regarding your own rewrite discussions, you're not going to be convinced by a post on HN, I'll just say that I am very reluctant to even think working at a workplace that doesn't predominantly uses Rust. I've been in the industry for 20+ years, across many stacks and there's a before and an after Rust for me. It has been a super power and made our life easier. It makes it easier to model business problems thanks to algebraic data types and their usage for error handling (versus inheritance), traits allow to abstract behavior better than OOP-style interfaces, the absence of data races is a game changer for multi-threaded code, dependency management is trivial, the ecosystem is rich and things work well. A lot of these are properties found in other languages but no other has the same full package and is on its way to become mainstream.

[1] https://www.baldurbjarnason.com/2022/theory-building/


That's one of the best ads I've ever seen ;)

What about the excessive typing - in practice do you get used to it? Are you allowing Copilot and such inside the business?


If by excessive typing you mean complex trait bounds and the like, it's really something isolated to the most generic libraries, and reasonably rare. Most application code is simply typed - take a struct or an enum, return another. So it's not been a problem.

We don't use Copilot or any tool that sends our source code because of our security posture, defense in depth, etc. If and when there will be an offering that lets us run the server component and provides significant productivity improvements, we will subscribe.


Rewrites typically work well when the people behind the rewrite are the same people who wrote the original code and maintain it.

Often, the requirements changed along the way as the problem domain incrementally became better understood. At that point, the original design is not helping, but sabotaging everything.

This is why that first version should always be considered a prototype. And the next version will probably also be.

Not rewriting will have a much larger cost down the line.


Our experience may have not been the same, but I beg to differ. If the old solution is so problematic that nothing can be salvaged from it and the best solution is a full rewrite, instead of some refactoring or modifications, then your problem is not so much that the code is solving the wrong problems. The problem is the team that built it. Every time I was ever called to rewrite a code base, there were certainly some elements that pointed to a clearer understanding of the requirements, but mostly, the necessity to do it from scratch pointed at a people problem.

Many successful projects started as reverse engineered clones of old and established ones, that then became improved versions of their predecessors. Those are rewrites, just not done by the original project's team.

My opinionated rules of full rewrites, informed by experience and observation:

1- Bring in one, just one, project lead who is a specialist of the technical domain. E.g. if you're building a web app, hire a seasoned web developer, instead of relying on your in-house electrical engineer, whom you allowed to architect the previous solution, because they managed to convince you that code is code.

2- Let the new lead vet every member of the old team, including (especially) the old team leaders.

3- Allow the new lead to drop any dead wood.


> If the old solution is so problematic that nothing can be salvaged from it and the best solution is a full rewrite, instead of some refactoring or modifications, then your problem is not so much that the code is solving the wrong problems. The problem is the team that built it.

I think you're too quick to point fingers and too desperate to throw people under the bus to pass yourself as saviour.

There are plenty of everyday scenarios where software piles up technical debt in spite of the developers. All it takes is a single requirement to change for an entire tech stack to become a problem instead of problem-solver, and all it takes is a business goal to be met at record time to pile up quick-and-dirty solutions instead of well-architected implementations. These happen far more often than replacing whole teams, and new project leads solve nothing.

I've worked on a legacy project which started as a multi-platform desktop app that in the meantime became Windows-only. You can imagine the cruft that resulted from this requirements change alone. During the same period, business requirements changed to support new major features, and Microsoft started pushing for Windows 11. Of course we discussed a major rewrite, as the legacy tech stack didn't supported native Windows features well and the legacy application was riddled with multi-platform code that made no sense anymore. Switching to vanilla WPF alone would eliminate 90% of the project's pain points.

Tell me exactly how the team created this problem, and how you would be the key to fix it.


yes, and also developers grow over time - not every company starts with 20 years of experience seasoned developers.

Lots of start-ups start with developers that have 2-3 years of experience. Once they reach the state of a rewrite those developers (if they stayed for the whole period) are probably 3-5 years more experienced than when they started and might look at things differently than at the beginning. And those years can make a huge difference especially during a start-up's growth phase


It sounds like most of the application complexity was the UI layer? In a typical app, the UI is an interface to some functionality or business logic which is the actual value of the app. But if the application was something like a GUI builder, I guess most complexity would be directly coupled to the GUI framework.

Still, in my experience, ripping out code which isn’t used anymore is a lot simpler than rewriting the things you do need from scratch.


> I think you're too quick to point fingers and too desperate to throw people under the bus to pass yourself as saviour.

Thanks for the free armchair psych analysis, but that's not why I came today. I'm here to compare patterns about our practice. My answer is not about me as some kind of savior. It's also not some definitive rule of thumb about the cause of rewrite worthy mess. It's about my own observations as part of one specific domain that I'm good at. If another expert of their domain noticed something different, it provides perspective. If they confirm my view, it reinforces the possibility of a pattern.

> There are plenty of everyday scenarios where software [...]

You're describing your garden variety software development. The SOP for technical debt is refactoring, not rewrite. Despite what's thrown at the dev team, it's part of their practice to keep the code maintainable. If you find that the code base grew beyond salvage, then, yes it's possible that it grew too fast beyond their control. But in my experience, the need for a complete rewrite, instead of carving a bit more time for refactoring, tends to point to people as the root cause. Someone is either preventing or has prevented refactoring, or/and has caused too many problems. Sometimes people do it actively (e.g. team leader doesn't want to spend the time), or passively (e.g. team leader can't communicate to management that the team needs the time). Sometimes, those people are not even in the team, sometimes they're long gone. And yes, sometimes the culprits were newbies who have since grown into much better developers. But unless you identify what made the code so bad that you just need to rewrite from scratch, you're likely just setting yourself up to repeat exactly the same mistakes. If you reread my answer, you'll notice that I advocate to bring in a technical domain expert, but I also advocate to vet the team anew, and get rid of the excess. What about that isn't common sense?

In many occasions, I've found that I was actually not needed at all. Some team members had actually tried to previously push for exactly the same solutions that I would end up suggest much later. But due to cultural or political reasons, they were hampered by tenured team leaders. This scenario is so common that it's almost a cliche.

> I've worked on a legacy project which started as [...]

Even after reiterating that I'm neither a savior nor a know-it-all, any attempt at an answer would be futile, since I know nothing of this project.


> Thanks for the free armchair psych analysis, but that's not why I came today. I'm here to compare patterns about our practice.

The Rorschach test you inadvertently did in your post resulted in you throwing whole hypothetical teams under the bus to then try to pass off yourself as the holder of the one true fix. You fabricated the hypothetical problem, you were quick to point fingers to throw blame around, and you were quick to present yourself as the fixer. There is no way around it.

> You're describing your garden variety software development. The SOP for technical debt is refactoring, not rewrite.

I didn't, and you don't refactor out a whole tech stack written in an entirely different programming language targetting entirely different platforms.

Again, you didn't even pay attention to the problem and you're quick to point fingers and present yourself as the fixer. There is a pattern in your behavior.

Humility is a virtue, and I'm not your psychologist. Have a nice day, and try at least to be kind to others instead of succumbing to your need to throw everyone under the bus.


For what it’s worth, I see this pattern too.

Rewrite a legacy .NET MVC app with lots of cruft. Build a new Node.JS app with lots of cruft.


Why can’t the existing code just be adapted to the new requirement? Code is supposed to be mallable. But perhaps the code was designed too rigid to support changing requirements. In that case you will have exactly the same problem after the rewrite, next time requirements change again.


It depends a little, sometimes companies (especially start-ups) pivot quite substantially - With a Fintech that I've worked with from the beginning (which went through ups and downs, but ultimately ended up quite successful) we started issuing debit cards and pivoted to loans. At some point we just needed a different app that was properly written for the use-case that we ended up serving. It's like the parent said, the re-write was done by the same team (mostly) - those were 4 busy months but it ended up quite successful and imo it was worth it at the end. And I have to admit I was actually very critical of the re-write approach initially..


If the new app litterally does something entirely different, then I wouldnt consider it a rewrite.


Well, you kinda answered yourself: code is supposed to be malleable, but sometimes it isn't.

Even when it follows trends and best practices, you might end up with non-malleable code. Perhaps especially when you over-do trends (eg: metaprogramming, 15 years ago) or best practices (design patterns, 100% unit test coverage, etc).

You will only have the same problem if you fail to solve the malleability problem.


Yes but a rewrite by the same team who is unable to write mallable code will not sove this problem.


Teams can change and learn from their mistakes.

Or perhaps the problematic people who made the code too rigid already left.

Using different languages and frameworks will also affect the outcome.

Maybe the Product/Business side is more mature and will account for malleability in the requirements.

Also, perhaps they already know how to do it, but "malleability" wasn't a goal for the previous version.


If developers have learned to write mallable code, they are also competent enough to improve existing code by refactoring, since this is two sides of the same coin. Maintaining code is how you learn to write maintainable code, after all.

I dont know how code could be too rigid to change by a competent developer. If it has bad abstractions just rip them out. If it is unclear and badly documented, you will have the same problem when rewriting.


> I dont know how code could be too rigid to change by a competent developer.

This is an unfair characterization of what I said. I never said it was impossible to improve, just that it is too difficult and costly.

Whether a rewrite is warranted or not, is up for each individual team to decide, we shouldn't leave this decision to dogma. I have run into some cases where a rewrite was the correct thing to do, and it did pay off.

> If it is unclear and badly documented, you will have the same problem when rewriting.

A program being unclear and undocumented on the development side doesn't automatically means it is unclear and undocumented on the business/product side. There is also the possibility of changing workflows or simplifying functionality.

A lot of people in this topic are mentioning full reuse of tests, but not all software rewrites should be done in a vacuum by the development team without continuous input from the rest of the business.

EDIT: And there's another case: sometimes the app itself is actually quite simple, but understanding things from the code-side is very difficult. I'm currently finishing a rewriting of one of those: the code is too large and unmanageable due to accumulated cruft (it is difficult to know what can be removed), but some product managers, business owners and business intelligence people have an amazing understanding of the database (better than some past developers, actually), and the database is simple, so they can verify if the new version is correct and even draft requirements based on it.


Obviously satire, but in real life, there are reasons why a full rewrite becomes appealing. Perhaps the solution even.

One is overwhelming technical debt. Code where the project manager didn't believe in encapsulation, or refactoring, or none of that "architectural nonsense", and was only fired 5 years too late. Code that is difficult to understand, maintain, test, debug, change. Code that follows you home after-hours and on the week-ends. Code that nurses you to bed at night, shows up in your dreams, and wakes you up in the morning. Code that has made many a colleague look for employment elsewhere and new hires give up and quit in their first week.

Every time I see someone profess with assurance that you don't rewrite, I just know that that person has never really experienced the hell I've described above.


I am against rewrites of any significance because they generally just end up worse. Joel Spolsky wrote about that a long time ago as did others; most software that’s older has millions of badly documented changes applied by 1000s of people over the decades and rewriting tends to take literally forever (never finishes) or at least much long than anyone estimated times PI. And then the endresult is usually just as crappy but with more bugs.

The Dutch tax software rewrite attempts are an example I am personally familiar with. These attempt made me create a services company to help companies keep legacy software running forever. We support gnarly stuff over 25 years old and still wouldn’t recommend a rewrite for the above reasons.

There are of course cases where rewrites (of significance; rewriting a 50k LoC codebase is not going to be hard) work, but usually the rewrite is done by the same people that did the original , the original wasn’t actually that bad but just too hard to extend in modern times etc.


Joel published that post the same year that Microsoft announced their .NET framework.

I've said this in a different threads, but I think it's worth repeating here. I've seen many successful projects that started out as reverse engineered clones of older, well established ones, then go on to become better versions of their original model. Those are in effect rewrites, just not done by the same team. For a number of years now, it's been feasible to compete technically with an incumbent in matter of months. I see it as a trend indicating an increase in software production capability, be it because of better practices, better tools, or more developer availability.

So although I tend to agree with Joel's old post, and still lean toward refactoring as the likely more economical approach, I hold that advice less religiously.


I think reverse engineering is often better than a rewrite from code, but again, as the dutch (and I have seen it in other countries too but I am intimately familiar with both failed rewrites attempts of this one) tax software system showed, it is sometimes just not feasible, no matter the money or the amount of experts. The amount of tax payer money wasted on the attempts is criminal.


> Code where the project manager didn't believe in encapsulation, or refactoring, or none of that "architectural nonsense"

If anything I find the largest proponents to have drunk too deep from that well and cause the rewrites to never be considered, as the time required to do it becomes far too long to be worth the pay-out.

This excludes the worst kind: the overarchitectured old mess in need of a rewrite as it was based on the wrong assumptions and is now boggled down by 10 layers of abstractions and indirection which don't do anything.


It depends on the level of the architecture. Architecture that splits the project into chunks that you can take a meat cleaver to and refactor at will is great. "Architecture" that takes one of those chunks and adds 15 abstractions to it is awful.

The former lets you recover from the latter without a full rewrite, which I'm guessing is where advice like "never rewrite" comes from.

Posts that are pro/anti 'architecture' could refer to either, so I never know whether to agree with them or not. They're kinda meaningless out of context.


What you are describing sounds more like a problem with leadership to empower ICs and groups to make improvements, or possibly a culture of dumping code, declaring victory, and moving on. If new hires are bailing in their first week, the problems run far deeper than the codebase. Rewriting anything is not likely to change the long term end state unless the company culture has shifted. I have yet to experience an organization where a real cultural shift has happened.


  > One is overwhelming technical debt. […]
This sounds like an organisation with bigger problems than a single manager. The devs hate the job, but don't have the clout to convince management to let them do things properly? Run. It's not worth it. An organisation doesn't suddenly "heal" once a bad apple is gone. More likely, everybody is at least five years behind current best practice, and very much used to doing things that way.


I have see horrible code, but the point is that a rewrite will not solve this. You will just lose years of opportunity and end up in exactly the same place again after the rewrite. Because the reasons which lead the first version to become a big ball of mud will also cause the rewite to end in the same state.


Somehow I felt like you were describing my Factorio factory…


I have seen many organisations try this and end up with a second and even third system with less features running in parallel with the first never caching up to be feature complete.

I am of the opinion the only real way to do this is take small pieces and replace them and keep the main system running slowly replacing parts of it. It will never be complete but progress can be made in areas that need improvements. A complete rewrite isn't worth it, they fail so often, cost far more than any one thinks they do and rarely achieve the magic improvements they were sold on.


The only way is to accept that the new system is less complete and switch over and suffer the consequences. The new system will never have all the old features - the question is whether it is good enough to use now and extend in the future. The mission of the builders of the new system should not be feature completeness nor being better but to kill the old quickly while maintaining a reasonable level of future proofness. This can not be driven from the bottom (re-write for code beauty) as it requires business level commitment to bear the pain of the changeover. Software are coded business processes and new software means new processes. A major cost driver for software are inflexible requirements and taking old code and processes as gospel is a guarantee for a cost explosion.


This is true in my experience as well. Business needs and expectations should be fully aligned, and there must be commitment. Business/Product can't expect devs to do an unsupervised 1:1 rewrite that magically includes all features. You gotta treat it as if it were new software: do it incrementally and watch thing closely, test and approve or ask for change. Business must also learn to adapt and modify their workflow.

To me this is where a lot of rewrites fail. Not only rewrites, but even implementations of off-the-shelf systems. The company (both management and other workers) is so entrenched in the current processes that they will keep pushing for the time and money available to be spent on constantly tweaking the new system until it exactly what it was, and then all the old problems come back again. I've seen companies blowing millions on consultants because of this.

But this also happens for new features. The current system I work at is in its 4th permission system, for internal permissions. The problem is not so much that requirements change, but that nobody in product/business really knows them. So each permission system starts alternating between being too malleable (and then it devolves into chaos in the settings) or too rigid/simple (and then it then devolves into people asking for too many permissions).

You gotta fix the business before fixing the software.


The strangler pattern. I agree.


Please be careful with calling it that.

Surveys differ a bit on this but somewhere between 3-10% of women report having been strangled by an intimate partner at some time in their lives, this may be 20% higher when non-intimate partners are also factored in. So like “one in 12” is not crazy-talk. Even if your dev-team right now is all-male, your design docs may live to see you diversify and you probably have female non-technical coworkers who will overhear you talking about it. It's not worth taking a one in 12 chance of potentially reminding them of past domestic abuse, I get that it's not your intention but.

Yes I get that it's by analogy to the “strangler fig,” but ,

(a) it was a crappy analogy in the first place[1],

(b) you can just call it the “strangler-fig pattern” and the extra syllable makes it like 20% more rhythmic, 10% more clear and 50% less alienating without sacrificing any googlability as this is what the cloud companies call it,

(c) you didn't really need the analogy, “migrate” was already the established term of art for this pattern, means the same thing and it is already verb-ified for you! “Our first priority is to migrate all our current requests,” vs “our first priority is to use our strangler-pattern to subsume the current requests under the new architecture,” whyyyyyyyyy. So you're gonna use the word “migrate” anyway and if used precisely[3] there is nothing added by evoking the humble strangler-fig, apart from, you know, accidentally sounding positive with regard to domestic abuse.

1. Taking the definition as “incremental replacement behind a proxy until the original system eventually dies,” literally the only thing that is correct about the analogy is “eventually dies”. Even if you are calling a tree’s contribution to the leaf canopy its “feature set” and the developer attention is its “nutrients” to try to save the analogy, you come to the conclusion that strangler figs do “rapid feature development” at first, rather than trying to rewrite the system. And this early development is symbiotic rather than parasitic [2]. Strangler figs don't do the strangler pattern, they do Embrace-Extend-Extinguish.

2. https://link.springer.com/article/10.1007/s13199-017-0484-5

3. So, “migrate functionality” is popular but very imprecise, you want to stay that you are migrating the requests, or the request-handling, or the users, to the new platform. The word functionality should probably also be thrown in the trash, it literally adds three clumsy syllables to a word which is already its synonym, the “function” of a thing is already its “functionality” and if you really wanted to not use the mathematical word function, just talked about its “features” or “feature-set.”


I appreciate your intention to care for others, although I disagree with your most of your other arguments. I do find it irrationally irritating that you appear to assume I'm male.


Oh, I didn't mean to do either. (That is, either make a coherent argument per se—just like “here are some reasons that this language sucks” but it does not imply “that is a holistic overview and therefore there are no redeeming qualities and it sucks in all contexts” which is what an argument would look like—or to assume your masculinity— ... like obviously if you were in an all-male dev team you would also be male and so I suppose I should have said “everyone else” but really I just figured you were probably not on an all-male dev team and the problem would be obvious that way.)


What I always love about “the rewrite” is the sheer optimism of the people involved - “we’ll be done in six months, and the new system will be a thing if wonder”. Fast forward to several years later. The original proponents will have moved on leaving behind the accumulated bodges and shortcuts from the increasingly desperate efforts to try and get something live… and then the cycle repeats.

There are ways to do this properly, but no one wants to put the effort into understanding the existing code base and the reason for why it is the shape it is. Everyone is happy with the “who wrote this crap - we can’t work on it” line.


I've only witnessed one full rewrite in my (admittedly short) career so far, and it went shockingly well. The goal was to rewrite a C++ application from the 90s in Java, since the company had largely moved to the web and C++ devs were getting close to retirement. The C++ application was also written in the underfunded startup phase, and then over decades new features had been "tagged on", so the architecture wasn't that robust.

The team tasked with the rewrite set a goal to finish in a year, the first 4-6 months were entirely spent on planning. After that, features were implemented in a modular and iterative approach. I think overall it took a bit longer than a year to be feature complete, but by the end of the year they had a working platform with all the core features implemented.

I think the key here really was very good planning, and the pitfalls you describe can be at least partially ascribed to agile development not being the right tool if you have a very clear and large set of requirements.


Would you mind elaborating on the kind of planning that went on?

I’ve been involved in instances of planning that meant writing out pseudo code in the whiteboard for the entire software, including errors and exceptions and the final task was ‘just write the code!’. It started off well when coding the big picture modules but as we started getting into the details, bugs, unexpected race conditions and more started coming up.


I wasn't involved myself, just talked over lunch to some of the guys on the team. The customers for the software are other businesses, each with a pretty large contract volumne and direct negotiations, which meant there was already a lot of documentation on the actual requirements of the customers. From that a ranking of core requirements was compiled, a ranking of requirements that would need to be fulfilled due to current contracts, as well as a ranking of requirements that would probably be required in the future. I think that was most of the first two months. In addition they probably scouted out options for tech stacks.

With that in hand, the rest of the time was basically spent with UML and Figma. So very large iterative class diagrams and User Journeys -> UI/UX.

I think by far the largest advantage they had is that the project manager and several devs had been with company and the product since the 90s, so there was a lot of institutional knowledge on the product and customers.

However, after writing the post above I asked the lead how things are going, and they actually still aren't feature complete (3 years on). So while it has fully replaced the old product and 90%+ works, extending to some of the features is still very hard.


> the first 4-6 months were entirely spent on planning

This is the most surprising part of the story. I have never seen this kind of “big design up front” actually work. Can you elsborate on how you made this work?


I see this story here all the time, but I have never seen it play out in the real world. Most rewrites I've seen have been hugely successful. This sounds more like a sticky narrative that everyone keeps repeating to tell a nice story, get attention, and make them look experienced. In reality, no experienced engineer is so naive as to think that the new system will be a perfect thing of wonder or not look at the tradeoffs made in the old codebase. Reality is more nuanced than these simplistic fictional narratives.


I wish that were true. I’ve seen ridiculous catastrophic failure twice. I still don’t know if they actually believed their own bullshit, or just wanted to do the rewrite and lied to make it happen.


I think a rewrite is fine as long as it's incremental and well-integrated each step along the way, rather than starting off from scratch. Unfortunately this seems like the sort of lesson each engineer needs to learn through experience rather than hearing someone tell you about it. Live and learn I guess. :)


I founded a company and reached $6M ARR with >50% net profit margin. After 4 years decided to do a full rewrite. Finished it successfully within a year including full migration of all clients. Was the best decision ever.


Do you consider the rewrite a good decision because it helped increase profits or for other reasons?


Not op but I would say that profits are only one part of the equation. If you have massive profits and you don't put resources in other areas (HR, engineering, etc.) You lose sustainability and might stall and die eventually. It's a tradeoff between long term sustainability and short term money making IMO


Congratulations. It reads like you were in control, the driver seat and into the code base. What I want to say is that I consider a rewrite possible because you controlled most /all aspects. Currently we are moving from several ERP instances/configurations due to acquisitions to one completely new ERP version and architecture. This is a huge project consuming enormous resources and needs a director reporting to the CEO.


Am in the middle of a rewrite now. I'm always against them, but this case is clearly perfect for it.

- The old system used to have many users, but only one large organisation was left.

- The way they use the system is atypical of what it was originally intended for. They use like 20% of the original functionality.

- The old system was started by the very first software our company ever produced (long before they brought in professional software developers), credits to their choice of Django that it worked for fifteen years, but it wasn't very good.

- But in the last nine years of that, hardly any maintenance has been done on it and now none of the build tools work.

We're making a much more focused, modern application now that does everything the customer used in the old one but looks completely different.

So there exists at least one situation where a rewrite is the answer.


How to do a full rewrite:

1. Create a subsidiary company and transfer the code ownership to that.

2. Sell the subsidiary company to an investor for big money and run away quickly.

3. While rolling around in VC cash, green field a better product from the ground up without all the horrible things you did last time.

4. Goto 1


This topic is underrated and extremely important in YOUR career. Youngsters believe rewrites are for old people who didn’t get their code right.

Business apps only last for 15 years max. Every app requires a rewrite after 10 years, apps that don’t do it weren’t actually important for business.

Software stacks are flimsy. Apps need to support NPM, then Kubernetes. There are new security discoveries and way to prevent them. How do you list your libraries if you are not using NPM? Also we need to move the app to the cloud because installed software is not fashionable. The original software may be reasonable, but assumptions and expectations change.

Why don’t we see them more often, then?

The response is: Because they are worded as “Brand new app from scratch” in job descriptions, and because NPM hasn’t finished its 10-year cycle yet. But rewrites they will come, and you better be have a methodology for that.

When you build an app, one day, a requirement will be to make apps easily rewritable.


Nope. I work on software that is 40+ years old. The code runs major airlines and airports with very few problems. We are currently upgrading it to run in a browser. Without rewriting it.


> Youngsters believe rewrites are for old people who didn’t get their code right.

Youngsters haven't yet seen the tyranny of managers and their always changing requirements.


The first rule of rewrites is don't. That is almost always the correct decision. Like Lucy and the football, if you're sure that this rewrite is different, you need to consider how that'll happen.

If you build in parallel, you need to decide what work will go into the existing system and how long the new system will take. Do you pause new features? If yes, I guarantee you that'll be an issue. I also guarantee it'll take longer than you think it will. If you don't pause new features, you'll be chasing moving goal posts so it'll take even longer. All of this is significant extra effort that you have to hope to recoup somehow with the new system.

Remember that it won't take that long before your new system is also considered legacy.

Once you have that new system, how do you deploy it? Dropping it in place is generally a high-risk strategy that'll greatly slow your migration.

Instead of the above, what you generally want is to do partial rewrites in-place using infrastructure you may need to build but you'll want anywhere. Here is the progression of, for example, a backend migration:

1. Double-write to the old and new backends. The new system is just a dummy, eseentially. You won't be using it (yet). Do offline verification on the old and new backends. Look at metrics for double-writes vs single-writes (eg latency, general performance, crashes);

2. Start reading from the old and new backends. Do a comparison of the data retrieved from both. Log and flag any instances where the data differs. You've now verified the new backend;

3. Start reading from the new backend;

4. Stop writing to the old backend.

This requires things like a robust experiment framework so you can, for example, double-write for 1% of your users and then compare the two groups easily across a wide range of metrics to see if there is any unexpected regression.

The point of all this is you can do partial rollouts of any of these steps and, more importantly, a rollback is trivial.


The ease of accommodating new features along the way is actually a fine opportunity for the design of the new system to prove its mettle. If new features turn out to be troublesome, the new design is probably flawed as well.


> what you generally want is to do partial rewrites in-place using infrastructure you may need to build but you'll want anywhere.

Worked in a place where they were embarking on their third generation of failed incremental replacement of a system originally built in COBOL. No indication third time was the charm either. Big problem these projects is they ran for so long that whatever they were using stopped being the hot new thing half way through, so they always needed to restart with a new paradigm before the old one was fully implemented.


I once was involved in a bigger rewrite/refactoring of a backend where in a few areas we didn't have enough tests and I found goreplay to be very helpful as you do not need to explicitly write into two backends but the request data will be "doubled". Still stateful requests were a bit problematic but you can write so called "Middleware" for goreplay (in different languages) and I was able to solve it this way.


> backend migration

That's cute. I'm in the middle of a migration of a vb6 app, no separation of concerns, database access from the app, no tests, no documentation, no code homogenisation, tons of function calls that end up in an empty string and with MS threatening to end vb6 support just around the corner.


Rewriting is so attractive.

If the code has aged and is suffering from many code smells and anti-patterns, a rewrite becomes even more attractive.

"Why should I spend all this time adding a simple feature to this crappy code??"

But writing code is the easy part.

Architecting a correct solution that meets today's business needs and can be built upon, as well as walking the data, users, and business workflows over to the new system, are the hard part.

I've never seen it go smoothly. I want to see it happen, because I'm an optimist, but so far it's not gone well.


Oddly, your username is relevant here. It's a common refrain of mine when I hear about someone undertaking a rewrite that becomes something of an albatross.

The only times I see rewrites succeed (where success is measured by how often your customers notice you breaking stuff) is when there's a comprehensive set of integration tests to write against. That really seems to be the sole determining factor.


I don't know. Tests freeze design so if you make your tests too comprehensive you'll just rebuild the existing system. This might be the correct approach if you are changing platforms/languages but often what you want is a system that is actually fundamentally different from the one you started with.


I would agree if I were considering unit tests, where the implementation details fall into testing scope, but I was deliberate in my word choice.

Integration tests exercise behavior at the boundary of the system under test, where it integrates with other systems, which is ultimately what you want for steering a rewrite.


Thats probably not a bad thing if it's behavior that is frozen. I find the most successful rewrites change as little as possible and iterate gradually while the least successful rewrites try to be a mix bag of everyone's wishlists.


> But writing code is the easy part.

If it's inherently easier to write new code than maintaining old code, then we should do our best to also make it a good choice operationally. Maybe write a new service/module alongside the old thing and deploy that, instead of opting for a full rewrite immediately. What else are we supposed to do, suffer with codebases that decline over time, instead of using more modern tooling and frameworks?

If adding a new module to a dated legacy Spring system would involve messing around for weeks with XML configuration mechanisms, brittle runtimes and mechanisms in the codebase and integrating everything with the existing solution whilst having risks in regards to overall stability, then you might as well bootstrap a new service in Spring Boot/Quarkus/whatever you're comfortable with and deploy it alongside the old thing. (using Java as an example here, because the old Spring projects I've seen got pretty bad sometimes)

With reverse proxies or other gateway solutions, as well as containers and orchestration, deploying and routing traffic to this new service (even just for new endpoints in your existing domain/API) is not a big issue and many other concerns are covered, in addition to this new service being simpler to change and evolve over time, as well as replace altogether.

However, the situation around the data and integrating various services with one another still is difficult - because with microservices your data model would be strewn across multiple separate databases, whereas with multiple services connecting to the same database you'd end up with something that's hard to reason about. Whereas if you call APIs directly, you're also introducing an unreliable network connection into the mix between the services, which has some overhead as well, or a complex message queue/pub-sub solution.

I'm glad that we're far along enough for the former problems to be at least reasonably solved (things like 12 Factor Apps are great), but I don't think that there are all that many solutions for the latter group of issues, unfortunately. Even worse if you do a full rewrite and there's an expectation of the new thing to 1:1 match the logic and features of the old one, then you're setting yourself up for failure, unless the project is simple.


Wow - I just want to express some gratitude for the thoughtful response here.

You've captured the essence -- although we have solutions for when the impulse to rewrite an application occurs, we don't have solutions that don't bring their own problems.

Further, the natural human expectation for the software engineer to be like Chuck Norris, who can Fix Everything Without Changing Anything, needs to be considered and managed.

The opposite - where, as someone else termed it - the new version is based on a grab-bag/wishlist of everyone's hopes and dreams, generally doesn't work out either.

The true path does seem to be:

* setup integration tests that validate WHAT the system does (blackbox), rather than HOW the system works (vs whitebox).

* build the new system modularly, offloading one part of the old system's work onto the new system, one part at a time.

* keep the new and old system synchronized, using eventing/message-queues/etc (this is going to be difficult)

* continue replacing the old parts of the system with their new replacements until the old system no longer exists

* Now you've created the old system again, but the code and architecture are clean and capable of the changes that were not possible before

* Move forward confidently adding new features and capabilities to your new system which does not suffer from the problems of the old system.


> When you find large chunks of code that seem to serve no purpose - get rid of those immediately!

Yeah, that's a good recipe to break your software in non-obvious ways that only become apparent later, and will end up costing you a lot of money and customer good-will. Chesterton's Fence applies.

My advice: Don't do a full rewrite. Instead rewrite/refactor the most painful parts in small, well-understood increments. Rewriting only 20% may give you 80% of the benefits. Be very conscious and explicit about what your actual pain points are and what exact benefits you'll get by rewriting specific parts.


Your advice only works if the "large chunk" of code isn't a large chunk of code lol. If you can safely refactor a subset of the code without the full context then of course you'd do that. But if there's a big enough chunk of code that people can't load it into their brain RAM before making a change, it'll just get bigger and bigger as more people tack on drive by changes. If you're past that point, it's time to ctrl-A delete specifically so you can apply Chesterton's Fence on future refactors.


In my experience it's always possible to proceed in small increments, and that's the safe way to go. Not being able to load a big chunk into your "brain RAM" is actually an important reason for only doing incremental refactors. It doesn’t mean you stop and leave it in some intermediate state. It also doesn’t mean "drive-by changes". It's a coordinated, planned-out and continued effort.


I normally agree with this sentiment but I have a story about this. I was in a situation where I was handed over old projects to expand on and the choice to rewrite or build on the same codebase was up to me and another dev (management didn't care or know what should be done in this regard). As "mature" devs we knew rewriting wasn't a great idea. So we code splunked for months, wrote tests, upgraded and cleaned out code bits at a time. Only after succeeding in this did we realize that 85% of the existing code was unused and most of the remaining 15% had pretty substantial bugs in it and naturally ended up mostly rewritten to fix. Turns out the previous dev worked alone for years, didn't use source control, didn't ever delete anything, and that the company has quite a history of pivots. After stepping back from this we realized we should have rewritten, that we basically did anyhow, except we took the longer road to get there but now still have to continue with some of the framework/lib choices the previous dev had made because we opted to stick with them in order to facilitate this piece by piece "rewrite". I don't think we did anything wrong here but it's funny how taking this mature approach, in retrospect, probably didn't help much and now we're beyond the opportunity for a rewrite. I know some folks will say that what we did was actually a first step towards a proper rewrite, and I'd agree, except the time to do that never came.


When I was just starting as a developer, I had to maintain an old system.

I didn't know much so I assumed, all it's problems stemmed from my missing understanding of the system.

After working on it for a few years, I came to the conclusion that this wasn't the case. The system was inherently broken and not doing a rewrite would be negligence.

I did a rewrite in the end, but I did it incrementally. Build a new API, and replaced parts of the interface with new implementations that used the new API.

Took a year, but it was worth it.


HN comments often complain about incompetent management, but in my experience these misguided rewrite-from-scratch projects are initiated by developers who want to explore some new framework or platform, or who just think green-field development is more fun. And then they convince management a rewrite is necessary.


I once pulled off a scratch rewrite.

The original author had spent months writing their own custom ORM framework. The big problem was that this framework didn’t support nested transactions, or validators. This meant that when users uploaded data, some of the data would go in, then a later part would fail and you’d get an inconsistent state. Then users would spend days fixing the inconsistent state through a clunky UI.

I rewrote the whole thing with Django in 2 weeks. Each upload was in a single transaction and if something failed it would roll back the whole upload with a message that said “Upload failed with error {}. No data was added to the database. Please fix the error and retry the upload.

The users loved it.

This was a “mature” piece of software that had been critical for many years.


That ORM sounds like prisma ORM


Prisma ORM is way better than what I had to work with.


I've been in charge of a full rewrite of a B2B system that looks to work out pretty well so far (will soon do the full rollout for the first customer after successfully piloting with 20 users).

I think a key to the success is that the old system is so crufty that it's not, in fact, a moving target.

The old system is over 20 years old, a web application written in C. Apart from security considerations, the UI looks archaic, there is no separation between frontend and backend, and data storage is in custom-designed files - no database engine, so it's essentially impossible to add substantial new functionality.Even the most minor config change requires a recompile.


I highly recommend reading “Working Effectively with Legacy Code” BEFORE even considering a rewrite. Isolating and rewriting parts of a legacy system step by step is almost always the right way to do it.


I don't know. I always thought of software as a living thing. As such, it is never complete.

For example, 98% of the atoms in your body are replaced every year. Doesn't it make more sense to think of software in this way, as opposed to e.g. a car or a building? I guess even a building needs maintenance (roof is replaced, windows are replaced, ...), but it's on such long timescales that we think of it as "finished".

If you think of software as a living thing where X% of it needs to be replaced every year just to keep it where it is, then there is no need for a "big rewrite".


It's not a very serious article.


My alternative solution is just to quit lol.


The problem is finding a new job where you don't have to work on legacy systems.


Isn't this what Google have started to do internally? Take already working systems and rewrite them.

Goes to show that if you have a money fountain then your business decisions can be indistinguishable from satire and you'll still be fine


Hey, this guy wrote instructions for our system replacement project! I wish I had read this a few years ago, then I wouldn’t have been so surprised.


I wonder how LLMs take such training data into account


You need to rewrite it if it's not in Rust


I find this kind of cynicism tiring. Full rewrites have major risks but also potentially major upside, let's leave it at that without getting dogmatic.

I've had conversations with "veteran" (i.e. old and no longer well-informed) techies who seem to think that all programming languages do exactly the same job, all DBMSes have the same ergonomics, choice of OS doesn't matter "if you know what you're doing", blah blah blah. This attitude isn't a mature counter to youthful starry-eyed overenthusiasm, it's just hopelessly jaded.

I rewrote the whole backend of my SaaS. I also completely redid the hardware and networking stack. This was because it was all a mess and fixing it would have been more risk and more work than just redoing it. Both projects were successful, and now the backend and the infrastructure are (a) future-proof, because they pass stress tests at 10x our current load, (b) much more reliable than they used to be, as measured by real-life uptime, (c) well-documented and (d) maintainable by people other than me. There's no magic to it, we just thought it through carefully and did some planning.


I’m honestly not sure if this is satire or not given the subdomain.

And definitely didn’t like the “People who don’t work on software… …are stupid.” - can’t stand that kind of chit if it’s not satire.

But guess to piggy back on other comments my experience with full V2 rewrites is - get laid off because the project gets canceled.

The things mentioned in the article can be also useful when migrating/upgrading large features within projects, which may be more common, but is kind of self evident imo

Edit: clause


It's obviously satire, and if you read the article for "people ... are stupid" you will see that is also satire


Even the subtitle "Fix the system, not your understanding" should be fair warning already.

On a different note, I wonder what it does to the reliability of AI models when they are trained on vast portions of the internet that include satire and April, 1st articles.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: