> Prefix your commit message with R for refactoring and C for the other kind of changes
No one would know what R and C mean, and I'd have to teach that to the whole team. Just use the word "Refactor" if you're refactoring as the first word of the commit, and don't use that word for everything else. You can't just start using your own made up abbreviations without getting buy-in from other people who work on the codebase.
This is a good article and I like the general idea it provides. I want to add a small disagreement, though.
> With the Refactoring hat , you are not allowed to change the way the code works.
This is the only part I have a small disagreement with. Sometimes, the act of refactoring it _does_ change how the code works, but it's a change in how the code handled ambiguous or extremely complicated states.
Sometimes refactoring results in streamlining an if-statement / configuration-detail mess and it's the right time to fix it.
It may be way, way, way extra work to put all the bugs intentionally back into the code in the refactor, and you may not even put the bugs back in correctly.
While refactoring, I have to understand what the component does or should do. I read the original code and sometimes the commit history for certain lines. Then I write tests to ensure those assumptions carry to the new code.
More than once, I ended up finding bugs in the old code this way. The ticket said one thing (ticket IDs in commit messages are a godsend), and the code did another thing. There was also a lot of unexpected business logic that was discovered after the first draft of the code, and not documented.
I had the same experience translating text content into calculators. Text can be ambiguous, but code can't. Text had errors that coded logic wouldn't allow.
In most of the examples of refactoring there's a strong distinction made between changing how something works internally and changing its behavior.
The behavior is its external behavior, not internal. As a somewhat trivial example, in Common Lisp you may use an association list like:
'((a . b) (c . d) (e . f))
This is grossly inefficient for some operations, but functions as a map. You realize, "Oh, we have alists with, potentially, thousands of elements or have to search it over and over, it's killing our performance". So you replace that with a hash table to get that nice O(1) performance on a lot of your operations. Your users though, never see a behavioral change because they still access the data using the same methods as before:
(get-value map 'a)
To them, there has been no change, and consequently for behavioral tests there has also been no change. It performs better (so there is a change, obviously) but it does not impact the guarantees you offered users of the unit.
> This is grossly inefficient for some operations, but functions as a map
I believe it's actually faster than a hash table for a decently large n, around the order of 50. Obviously this is going to depend quite a lot on how clever your compiler is.
edit: I was off by an order of magnitude, 500 -> 50.
I'm not able to find if any of the free Common Lisps do any kind of CDR coding[1].
I've never benchmarked the two as I usually either need something small (so use alists) or of large or indeterminate size (so use hash tables). There's a clear performance difference between the two choices at that second case, but since I never hit the "middle ground" I don't know what the threshold is. It probably also varies with each implementation.
Hash tables, though, should be O(1) for retrieval and updates and amortized O(1) for insertions. Alists are O(n) for retrieval and updates and O(1) for insertions (if you don't check for the presence of the key).
If you want performance, you may be better off using arrays in some cases. Instead of an alist or a hash table, make a struct - or even just a cons cell like: (cons keys values). Then have your association function do a linear search for target key on (car table), e.g. via #'position, and use its index to fetch the value from (cdr table) directly.
Similar to the defactoring-then-refactoring mentioned in the article, one approach I find useful is moving things towards point-free style ( https://en.wikipedia.org/wiki/Tacit_programming ), which can discard lots of extraneous layers and plumbing. Then variables and abstractions can be introduced, to avoid the more horrible aspects of point-free code (e.g. `(.) . (.)` sillyness)
> Automated refactorings are fast and safe—even when you don’t have tests.
Depends on the refactor. At work some people tried to do some automated refactoring with Resharper, which changed names of fields of a class, which broke stuff that relied on them for reflection.
I think that's more an argument against using reflection though. If you use reflection, you're effectively changing the rules of the language. No automated tool will be reliable after that.
Great point. We use method/class names in config files that get parsed and dynamically evaluated. You could break the whole application by changing class names as an init config could fail.
Also persisting stuff in customer DBs that depends on class names (and reflection) is nice source of terror. Shipping a new version of your software with a refactored class names corrupts the whole DB of a customer.
We have a whole framework that handles migrations of costumer DBs after updates. But when you don't even know that your changed class gets persistent (because of spaghetti inheritance/type hierarchy) and you do not enter it in the migration scripts, you duck up.
It's generally a good idea for each commit (or even PR) to be isolated unit of work on its own. Makes much easier to review, and isolate regressions.
- refactoring: one commit
- fixing a bug: one commit
- fixing another unrelated bug: another commit
- new feature: another commit
In old version control systems it might have been difficult to do this, but with git it's a breeze given some practice: even if you're in a middle of something, you can always stop working, do a temporary commit or stash; then checkout main branch, create a new branch; do the thing you want to do (refactoring or bugfix) in isolation; commit; then go back to the other branch, and rebase.
There's also git add -p for making partial commits if you did a lot of changes and want to commit only some of them.
This may be unclear. If there is code that is comfortably _sitting in production_, refactoring _that_ should be its own commit or PR.
If you're designing new code and you write something, tear it down and bring it back up, but it's never touched the main branch, it doesn't matter how much you do to it.
In my standard development flow, I commit often, but don't PR, and then at the end, I rebase my whole branch on top of main and shrink it down to one-to-few commits, usually one.
Since people are only looking at the commit set at the end of my work in the pull request, that's the only thing that needs to follow the 'refactor vs new code' separation.
It's very tempting sometimes, and small refactors are fine and normal practice, but if I need a substantial refactor, I tend to do temp commit; refactor; then rebase and reorder commits with git rebase -i. It really doesn't take a lot of time once you do it a few times.
The thing is, once you get a habit of committing often (also unfinished WIP work) in small portions, with minimal effort you can then reorder and squash commits and make the final PR really much easier to review _even for yourself_, and twice as much for everyone else
I’m the same. Refactoring is a way for me to “takes notes” as I learn a codebase.
Version control makes things so easy I typically always commit the refactoring work to mainline first (code review and all) then fix the bug or repeat with another refactor.
It’s also typically faster, because the refactoring alone is a significantly smaller code review than the refactoring+<change>, and I can keep working on the <change> (i.e. bug fix/feature/more refactoring) while that CR gets looked at.
I don’t try to separate every last code change that doesn’t affect behaviour, but I often find it easier to put major refactorings in their own commits. This has at least two advantages: it means you know there shouldn’t be any test changes showing up in the refactor commits, and it means there is less to understand and review in other commits that do intentionally change the system’s behaviour.
It’s not really a lot of extra work to do that once you get into the habit. Tools like `git add -p` to stage partial changes in files can be useful for separating self-contained refactorings from behaviour changes if you realise they’re happening at the same time and starting to get tangled.
What world do you live in where you can refactor without changing tests?
I mean, it sounds like an ideal one but I’ve never lived in such a world. Tests always seem to couple to the way code is currently organized, and I can barely imagine any refactoring change that wouldn’t require some changes in my tests.
It depends on your tests. Your integration tests should never be impacted by a refactor, or it's not a refactor (using the definition that refactors don't change interfaces). The tests need to be rerun, but they shouldn't be changed. Your unit tests and sub-integration tests (integration of multiple units but not the whole) may be impacted by some refactors, though, but that's pushing the definition of a refactor.
Your public interface to your lower level modules/classes are the only things you need to directly test. If that's your smallest unit for testing, then a refactor should also (again, in theory) not impact them because the refactor should not change that public interface, at most changing the internal (private) functions and structure which the existing tests can verify just as they verified the unit before.
If you restructure your units (split a unit into two, for instance) then you'll have to change your unit tests and possibly some integration tests if you instantiated the previous unit directly as part of the test. But that's pushing the definition of a refactor.
I routinely refactor without changing tests, and it’s a primary concern when I define interfaces and determine what to test. For what it’s worth:
- This works best if you actually embrace BDD to test for behavior rather than results. The behavior is less likely to change, and along with points below so too are the interfaces under test.
- I find the least test change in statically typed languages where defining types is the first “test”. Behavioral interfaces become much more stable as you define them if you actually know what they are upfront.
- Both of the terms “integration test” and “unit test” discourage this. Validating what your program actually does usually sits somewhere in the middle, and allows you to encapsulate a lot more of what would otherwise likely change in tests.
- Stricter interfaces fare better than looser. That doesn’t mean give up on the Robustness Principle, but it does mean don’t let it become infectious (I now want to blog about the “color” of functions which accept “options” or other miscellany). The sooner you parse/validate/have concrete and reliable types, the less your code under test and the tests themselves will change.
And that, while it’s my last point, is the most important: narrow types early. Get stability inside the seams of your program.
If you use very low level tests then sure, maybe you need to adjust your tests to match the adjustments in the rest of your refactored code. But since by definition a refactoring does not change the code’s observable behaviour, no tests above that level should be affected. I find that very low level tests tend to be fragile and have limited value in most situations as a result, so I don’t personally tend to use them much, but YMMV.
> Tools like `git add -p` to stage partial changes in files can be useful
Magit is great for this: it shows an interactive version of `git status` (the "status buffer"), with diffs that can be expanded/contracted for each file and for each hunk; we can also selectively stage only a few lines from a hunk.
Agree here. Helps those who follow to separate clean=up/optimisation work from change of function and specific bug fix work. Seems it would totally help code quality over time.
Someone has to read and like your PR to approve it. Presenting your work as a series of simpler steps helps the PR recipient (and you!) mentally verify it.
Not nearly as much work as untangling the changes-that-broke-something from the changes-that-made-the-code-better. If the first commit is behavior-preserving and the next commit is the one that changes the behavior, it's easy to revert the breakage but keep the improved code.
The other choice is to roll the whole thing back or practice forward-fixing. The latter is an entire discipline in itself.
Probably yet-another-canonical-link issue [0] [1]. This has been happening regularly recently (and I realized just now is why couple of my recent submissions got borked). This is in the HTML source:
I assume the submitter can re-edit the link and fix it, if they notice the problem, so perhaps a warning message to double-check after submission can be included?
No one would know what R and C mean, and I'd have to teach that to the whole team. Just use the word "Refactor" if you're refactoring as the first word of the commit, and don't use that word for everything else. You can't just start using your own made up abbreviations without getting buy-in from other people who work on the codebase.