Hacker News new | past | comments | ask | show | jobs | submit login
Long Term Refactors (max.engineer)
65 points by fagnerbrack 7 months ago | hide | past | favorite | 38 comments



You mean... iterative development?

It is simple, you break up whatever improvement work you want to get done into small enough pieces so that these pieces can be included to use up any capacity margin during your development cycle.

You always want to have some spare capacity to fill with improvement work. This way you can manage unplanned work by temporarily reducing your improvement work rather than overrunning your project.

The important part is that any improvement needs to be broken up into small enough pieces that each can be shipped separately. You don't want half done, unshipped work to be a continuing feature of your development process putting an overhead on everything you do.

As to refactoring, best refactoring is usually done in small increments. I look at the codebase, I see something I do not like, I pick one thing I can fix here and now. Lather, rinse, repeat.

Some of the worst mistakes I have seen that really put the existence of a project in question was bright developers getting their managers to agree on a huge step change and that change never getting done. This usually is some sort of rewrite even if devs try to hide the rewrite. Usually they say "let's create a brand new service where we will keep everything clean and then we will slowly migrate the functionality from the old to the new". And it rarely works.

Because of these experiences, I have pretty much banned rewrites at all projects I work with. I now understand that a rewrite carries with it a huge amount of various risks and that people are biased to overestimate the costs of the ugly stuff they see in the old application and underestimate the risks of things they are not aware of when deciding to do the rewrite.


> best refactoring is usually done in small increments.

Yes. And contrary to many people, I'm fine with the (eternal?) limbo that we then end up with: parts of the codebase using the new stuff, parts still using the old stuff.

I actually prefer that. Just add proper documentation, if the language/framework supports it: add deprecation notices and move on. Through e.g. "refactor on touch" we can move the old code to the new if we are there anyway, changing other stuff.

Sometimes I'll find that after a while, the old code is used so rarely that now's the time to just rip it out entirely - a small refactoring. Sometimes I'll find that the important pieces - important business logic, critical performance path, often touched code, exposed parts etc - can be changed to use the better/more-performant/more-testable/safer code. I think just leaving the last 10% is preferable over postponing the entire refactor untill we can also move the last 10%.

It's part of what I tend to call "Continuous Incremental Refactoring".


> add deprecation notices

one thing that I think needs to be more common is that you should not write simply "this method is deprecated".

You must always say "...is deprecated *in favour of* FOO".


yes!

And it's one of the first things I'll add to the lib/framework/tools/utils if it doesn't exist: a mechanism to mark code deprecated. It should require a description, the item that should be called instead, and optionally a URL and optionally a time/version after which it will be considered an error instead of a warning.

The deprecation should be configurable (through ENV vars, e.g. DEPRECATIONS=fail|warn) so that I can run a test-suite or some local dev env and have it blow up when it hits a deprecation. Useful for when working on them and to raise awareness.

Such a method/macro/function/annotation isn't that hard in most languages: just a proxy calling the deprecated method with the arguments passed in and then formatting a message and logging or raising that.


> And contrary to many people, I'm fine with the (eternal?) limbo that we then end up with: parts of the codebase using the new stuff, parts still using the old stuff.

That is fine when, as you described it, there are only two types of code: one still using old technique/technology A and the other using newer version B.

Where it becomes a problem is when you haven't finished the transition and realise you need to start moving on to C, or you leave and the new person doesn't understand the difference (and probably also has their own desired replacement). Before you know it, you have three or even more styles of code in play at once, with diminishing ability to keep moving to the latest version. This is sometimes called the lava layer antipattern [1]

[1] https://mikehadlow.blogspot.com/2014/12/the-lava-layer-anti-...


It is an antipattern. But IMO still better than "feature freezing" for months, delivering nothing of (business) value for weeks or months, working like madmen trying to remove the last 10% of the old code - which by the 90/10 rule, costs probably 90% of the budget.

I'd rather have three or four versions - provided these are marked well, documented, and so on. Even if the codebase uses four versions, at any point in time, there is still only one thats the right one today.


The part I find hard convincing people of is not that this is a good idea but that it is always a good idea and always possible.

Everybody thinks it's a good idea. But, their particular problem is a special snowflake that requires a big bang refactor and there's just no way around it.


It's funny how many people push for big bang rewrites and how few people have actually seen a successful big bang rewrite.

I've been at 5 shops across nearly 20 years. I've been on teams mired in promising to do a big bang rewrite of our own system. I've been on teams that had competitive teams hired to try to replace our system. I've never seen it work, not once.

It is a herculean, multi-year, double-your-current-budget effort requiring you to throw most of the functionality of your existing system out the window. It just never works out.

The naiveté is incredible. I've seen promised 6 month rewrite projects escalate into 2 year projects, and fail.. leaving the existing system live another 10 years.


I've worked on many "Big Bang Releases" and consider all of them a poor choice.

Even the single one that was a huge success: in hindsight, I would've rather migrated that through the strangler-vine-pattern¹. Yes, it worked out fine. We even managed to migrate it rather in budget and on time. The release went OK, and we didn't experience crucial data-loss or lost customers. So by all metrics it was a success. And even then: I won't do it like that again. Because part of the success was just random chance: if the dice had fallen just different, it would've been a much worse migration. And I don't like leaving such stuff over to chance.

edit: ¹ https://martinfowler.com/bliki/StranglerFigApplication.html


The thing I am seeing there is rather a dislike of the codebase than a desire to rewrite.

My rule of thumb is this: you need to touch a codebase _more_ to throw it away.

Tell that to whoever wants to rewrite. You will have to maintain the old application anyway (yes, this is a given, you are not getting an exemption for security holes) and you will have to de-risk the project by providing value (== release early, release often, continuously switch parts over - which requires work on the legacy side, too).

If someone wants to do a big bang rewrite then you need to fight for the importance to touch the "legacy" codebase. That is the actual reason why people are so religious about their rewrite.

Other than that: same thing, ~20 years and nothing good comes out of big bang rewrites and releases.


That's what one should do, but I generally see the incentives setup all wrong.

The most common scenario is that the fresh young smart ** hoodwinks the boss into giving them 1-2 years and 2 subordinates to help them do the rewrite in isolation. They come off all support rotas and have no responsibility for BAU operation of the old codebase. The old codebase is left with usually a same or larger team maintaining and still adding new features.

The other way I've seen this done is that senior management really wants the rewrite, shops around for a candidate internal or internal who promises to do it.. and then basically same setup as above.

Setting up an explicit "good" and "bad" team creates awful vibes, the "good" team never delivers and starts exiting as accountability starts to creep in.

Meanwhile the "bad" team hemorrhages staff continuously from the burn out of maintaining the old thing with less people while also being told your job has an expiration date.


> I now understand that a rewrite carries with it a huge amount of various risks and that people are biased to overestimate the costs of the ugly stuff they see in the old application and underestimate the risks of things they are not aware of when deciding to do the rewrite.

I see this a lot with large scale refactorings, where people usually start with the easiest case and work their way up in the complexity ladder, and by the time they reach the hairy stuff it's basically time for another refactor to account for what's been missed. I started to look at any proof of concept or refactoring proposal that doesn't take into account the complicated bits, the corner cases, with a lot of suspicion.


Exactly, this is perfectly put and something I've hit several times myself. I have learned to do the exact opposite: sprint to structure the project just enough to begin to tackle the hairy stuff as soon as possible, so that any changes required for the hairy stuff are being made without a ton of other code and tests to update along the way.

Recent example: the architecture that worked well to generalize validations & mutations for individual records of different types had to change drastically to support relational cascades between multiple records of different types, but it was only a day of refactoring to do it early, instead of a month if it had been left to later.


For projects where the estimated rewrite duration exceeds three months, we have employed an iterative approach to refactoring for several years. This methodology has yielded pretty good results.

We also utilize a series of Bash scripts designed to monitor the refactoring process. These scripts collect data regarding the utilization of both the old and new "state" within the codebase. The collected data is then dumped in Grafana, providing us with a clear overview of our progress.


An example: “Are We ESMified Yet?” is a Mozilla dashboard tracking an incremental Firefox code migration (1.5 years and counting) from a Mozilla-specific "JSM" JavaScript module system to the standard ECMAScript Module "ESM" system. Current ESMification: 96.69%.

“Are We X Yet?” is a Mozilla meme for dashboards like this.

https://spidermonkey.dev/areweesmifiedyet/


I saw the phrase “are we X yet” used in the Rust community (is Rust ready for games or whatever) but never realised the phrase’s origin with Mozilla. Thank you for the little piece of history.


AFAIK, https://arewefastyet.com/ (AWFY) was the first, registered in 2010.

“Are We Meta Yet?” http://www.arewemetayet.com/ is an incomplete and outdated list of some of these dashboards. Some domains expired and are now squatted.


If you have scripts that can count uses of the deprecated code, you can use them to detect regressions and generate build warnings if someone adds new code using the deprecated code. Periodically you can decrease the script’s max use counter, ratcheting down until you hit zero uses.


Oh, the idea of tracking the state of the refactoring process with small scripts is very cool. Obvious in retrospective too. These scripts would be useful even if they're only like 90% correct.


> Usually they say "let's create a brand new service where we will keep everything clean and then we will slowly migrate the functionality from the old to the new".

The "create a new service" part or the "slowly migrate" part? The latter sounds like it would be iterative in exactly the same way


> The latter sounds like it would be iterative in exactly the same way

I think you're comparing different things.

The alternative to "create a new service" is "refactor existing service iteratively".

The alternative to "slowly migrate" is ... nothing. You don't need to do this if you refactor the existing service (incrementally).

If the service has state (like DB), then this "slowly migrate" usually becomes crazy complex.


I've pulled off several large, successful rewrites so I'm going to respectfully disagree, even though I agree I have seen exactly the failed rewrites that would motivate someone to ban rewrites.

The crux of my argument is that some projects really are costly enough to maintain that a rewrite is an overall lower cost (pricing in risk), while others are fixable in-place and a rewrite is an unnecessary cost, and it's rarely obvious which is which for all of the same reasons that project planning and cost estimation are infamously difficult.

> The important part is that any improvement needs to be broken up into small enough pieces that each can be shipped separately. You don't want half done, unshipped work to be a continuing feature of your development process putting an overhead on everything you do.

This is an ideal but it should not be a requirement. If you "need" this for any improvement, you're going to miss out on a lot of the biggest possible improvements because they have exactly the far-reaching impacts that make them harder to pull off but also worth much more when you do merge them. If you reject any such refactorings, you'll only improve in small local ways and never large global ones.

Best case, you can try to get the best of both worlds by supporting both old & new interfaces to the same improved implementation, slowly migrating old edges to new ones. That's also just an ideal, and there are many ways it can prove impractical, e.g. subtle divergence between old and new types which is useful for the new implementation but makes it harder to interoperate with the old one.

The fact is, sometimes a design has a big enough problem that a large change will pay off, and sometimes an implementation can be bad enough that this change cannot safely be made within the existing implementation. Usually when I've seen those, it's because the regression testing was so inadequate that you can't make changes with confidence, and yet the code isn't factored in a way to introduce the testing without the refactoring itself facing risk of regression, etc. and it's a total deadlock that destroys a project.

Many, many real world projects end up in this state. Often it's because leadership prioritized deadlines more than quality, and promised to "fix it once it's launched", but then nobody was confident changing something that sorta kinda worked. If you don't invest in regression testing from the start, it'll always be too risky to add later. At that point, you may as well build a new project factored for safe maintainability including its own regression testing that will pay off forever, including testing that it does not regress on any use cases you can reproduce from the old project. You have to do something like this to break the deadlock, so it may as well fix other deficiencies as well.

I have saved several mission-critical FAANG projects this way, and even my managers agreed that it was a huge success despite the general resistance to rewriting large projects. It even takes less time than people will assume, because once you factor the new project for confident maintenance without regressions, you become far more productive working on it than anyone could ever be on the old project. You get there sooner than you expect to, and it pays off more than anyone can imagine because they're so used to the problems of the old project.

I'd also like to add that while a bad project can limp along for many years, it faces a different kind of problem good managers should fear. Only a very capable engineer can maintain such a project with a low defect rate, but they do it with great stress and frustration on their end, because truly poor maintainability hurts even the best engineers. The better the engineer, the more they feel the deficiencies of the project, and the more likely they are to want to leave. An engineer like that MAY be able to pull off a rewrite, but if you ban it, they're more likely to leave than play along.


> Often it's because leadership prioritized deadlines more than quality, and promised to "fix it once it's launched", but then nobody was confident changing something that sorta kinda worked. If you don't invest in regression testing from the start, it'll always be too risky to add later. At that point, you may as well build a new project factored for safe maintainability including its own regression testing that will pay off forever, including testing that it does not regress on any use cases you can reproduce from the old project. You have to do something like this to break the deadlock, so it may as well fix other deficiencies as well.

I'm not sure what you mean about it being too risky to add regression testing later?

It sounds like one of your first steps in the rewrite is building such a test suite for "any use cases you can reproduce from the old project" but in my experience it's hard to make the rewrite-or-not decision until you have that. Only after the investigation to understand the features and behavior do you have enough info to know if you can fix it in-place or not. (And "in-place" may still be effectively "complete rewrite", just one module at a time in a single codebase instead of in a new one.)


> I'm not sure what you mean about it being too risky to add regression testing later?

Sure, let me expand on that. It's uncontroversial that code has to be factored for testing, projects don't usually end up easily testable by accident. You can't necessarily write a testing harness around a project as a black box. How repeatable is its output at all, how easily can you add new inputs and expected outputs, how long does it take to run, how many other dependencies does it have and how easy are they to isolate, etc.

Some types of projects are more amenable to this than others. In general, if it is a CLI with some input files and some output files, and the output is always deterministic from the input, then regression testing is as simple as collecting some representative inputs and outputs. Services with stateful APIs and multiple backend dependencies are just about the opposite of this, and require a lot of engineering to harness at all.

I have inherited a number of HTTP and RPC services where the outputs were not repeatable at all, for a mix of good and bad reasons. For example, generating random IDs as part of the output sounds like "best practices", but it makes it much more difficult to test a chain of related requests (where the IDs have to be consistent throughout) and assert their responses. At best, it requires a lot of test driver code which is itself hard to maintain. Even including timestamps can interfere with this, though that's generally easier to mock out than random IDs are.

That's to say nothing of how many projects think it's fine to create a bunch of global state, often including global "helper" threads that can't be explicitly controlled by anything else. These either have no effect on testing and thus aren't actually tested in any meaningful way, or they have strictly bad effects and make things less deterministic and hygienic.

So if you want to refactor a project to be more deterministic, have fewer and simpler dependency edges to mock out, have no global state or background threads, etc. then that can add up to a huge refactor that, in itself, risks subtle regressions which you don't yet have regression tests to catch.

That's what keeps happening. You can't have thorough regression testing until it's hygienic and deterministic, but refactoring it to be hygienic and deterministic risks regressions that you aren't ready to test for.

Breaking this cycle requires creating deterministic regression tests that you can't yet run against your existing project because it's not deterministic yet. One way to do this is by manually capturing a bunch of real examples (using real data, to cover the edge cases that synthetic tests miss), making a deterministic version of each, and making that a test case for the new version of the code. Then you rewrite or refactor the code until it's deterministic enough to pass these tests. If the existing implementation is a tangled mess, and it usually is, a rewrite can easily be more practical than a refactor.

I contend that it's not even enough to make it pass each case individually. To prove that the logic is actually hygienic, you have to engineer it to pass all cases as if they were real parallel transactions in a single instance of the system, which also limits the test case design because they have to be independent enough in that dimension while still being inter-dependent enough to prove the corresponding logic works.

I've done all of that and more for a single complex backend with lots of mutation APIs, many having pages of business logic, operating on over a dozen different entity types with subtly different storage- and API-facing schemas, using real requests and responses running in parallel with real threads. It's a huge investment, but after building the test driver itself, all code added to the project has been extremely easy to test and not a single regression has made it past this testing. As much work as this was, it took less work than some individual small refactorings did to the original project, and now it just pays off every day forever. This is not an isolated example, but it's my best and most recent.

> It sounds like one of your first steps in the rewrite is building such a test suite for "any use cases you can reproduce from the old project"

Yeah, but one of my points it that you can't even run this against the old project if it's not deterministic and hygienic yet. It's a set of potential test cases for a future refactor/rewrite that's more testable. Once you start down this road, you have to push through to completion, because every change made to the untestable old project is another way the test cases can fall out of sync and risk the new project diverging.

> but in my experience it's hard to make the rewrite-or-not decision until you have that. Only after the investigation to understand the features and behavior do you have enough info to know if you can fix it in-place or not.

I totally agree, and that's why this might only be possible a year or more into the project. I don't think anybody should attempt this when they first take over a project. I do think people should be free to make this call once they're familiar with a project, and leadership outright banning rewrites out of principle cannot possibly help with that.


I was just on a such a project. Output was via custom binary format send through BT. Commands where generated through a wpf app. The code did not follow separation of concerns or MVVM at all. Just trying to disentangle Bluetooth from wpf was a major task with several unforeseen side effects (e..g newer code was just faster, which crashed the receiving BT device...).

Ultimately the contact for me was axed because the root cause (bad leadership) just did not understand that bad code needs fixing before velocity can go up again.


The crux of the classic Spolsky piece about avoiding rewrites is that you’re probably not “smarter” than the original team. True enough.

But, sometimes you are. As a well known example, someone rewrote http/urllib in Python as requests and urllib3. Better on perhaps every single metric, often substantially. Ban rewrites and it wouldn’t exist.

My own example, a recent modernization of an ETL project. Original code was a rickety, amateurish, impenetrable mess written in Java transcribed into Python. Loops over loops over loops ten layers deep—would often get lost tracing a bug. Author didn’t understand argparse or logging but used them everywhere. Author didn’t understand absolute paths so there were a dozen and a half custom path join functions sprinkled around the codebase because std path.join didn’t work with relative paths starting with a slash.

For new feeds, over two years I built a brand new streamlined (half the code), dare I say elegant replacement, with tons of quality of life and efficiency features. Like shell completion, syntax highlighting for interactive logs, and conditional download (304) handling. Headline feature, encapsulation of the entire model update process with a single manager method that does almost everything that took bespoke code written forty separate times in the old classes.

As the old feeds are going away all the crap is inching towards /dev/null. Can’t wait for the celebration. Gonna crank Kool and the Gang, I tell ya. :-D


You've highlighted a difference in rewrites I think is worth recognizing.

urllib -> requests & urllib3 is an example of a rewrite that does not preserve the original interface, which means it can fix more things but it has to convince people to adopt it for that to matter. More subtly, it means the old version still has to exist and still be maintained in some form. Best case, the old version wraps the new version, but even that can be a pain and in some cases can even limit the evolution of the new version.

Other kinds of rewrites preserve the old interface in all valid cases. They may be a lot more difficult because of their constraints, but the constraints also free you from having to make design decisions. When you're done, adoption shouldn't be a problem at all, and hopefully nobody ever has to maintain the old code in any way.

That's a great story of a much needed rewrite. Enjoy the well earned permanent retirement from dealing with that.


If you keep substantially the same interfaces, I’d call that a big refactor rather than a rewrite, though it gets blurry at some point.


> "let's create a brand new service where we will keep everything clean

This is maybe the core delusion here.

Somehow people think that the next project they start from scratch will have clean well factored code, even though empirically all their previous projects have not.


The first version evolves over time to new requirements, which introduces cruft as you can never prepare your code to be extended to every new requirement. Subsequent versions/rewrites benefit from that history. Using the existing version as a complete spec, and the history of evolutions as a clue to how to set up extendability, gives re-writes a distinct advantage in producing better code.


Every single time. The code base is what it is because of the business requirements, management, priorities and timescales demanded.

Any new system will end up in the same place very quickly.

Sometimes I've even seen management in on the delusion and promising "this time is different" to the devs.


Don’t call it a refactor if you want to convince biz, call it laying a foundation for a massive speed boost


I've found strange there's no "test" string in the post.


This article attempts to contribute something novel, rarely discussed, namely how to organize a long term refactor in a team. There are already many great posts and books focusing on the code part.


Mod parent up!


this is a good post, but there's missing context about the kinds of situations where this approach can work, and what distinguishes them from situations where this kind of incremental long-term refactor approach is unlikely to be feasible.

contexts that could make this approach more feasible:

- it's technically feasible to incrementally transition from the current state to the target state. there's a variety of safe stopping points where the system can be in a weird hybrid partially-refactored state, and continue to function, pass QA and ship.

- the changes that need to be made in order to transition from current state to target state are internal to a single component owned by one team, and don't require changes to any external APIs / interfaces, or large-scale coordinated migration activities by internal/external consumers or dependencies or users


Good article - we've helped tons of teams complete long-term refactors at my startup (grit.io) and this approach definitely works better than attempting any big bang migrations.

Having a small and representative example PR is huge for both demonstrating how to do the change and for getting others on board.

The one thing I would add is to set up some basic progress monitoring (cases/files remaining). It's motivating to have a graph tracking progress over time, especially when building momentum early on.


s/refactor/rewrite/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: