I've definitely lived with the zombie flags problem. Teams ship experiments that double the size of a piece of code, but never go back to refactor out the unused code branches. In shared codebases this becomes a nightmare of thousands of lines of zombie code and unit tests.
This is a social problem as much as a technical one: even if you have LaunchDarkly, DataDog etc making very clear that a flag isn't used, getting a team to prioritise cleanup is difficult. Especially if their PM leaned on engineers to make the experiment "quick n dirty" and therefore hard to clean up.
At The Guardian we had a pretty direct way to fix this: experiments were associated with expiry dates, and if your team's experiments expired the build system simply wouldn't process your jobs without outside intervention. Seems harsh, but I've found with many orgs the only way to fix negative externalities in a shared codebase is a tool that says "you broke your promises, now we break your builds".
This is the only way, more or less, to enforce any code standards, whether its refactoring, quality, testing, docs etc.
If it doesn't break the build (or do anything else that stops it from moving forward) there will always be external pressure to get things out "ASAP" despite in most circumstances "ASAP" isn't required. If you can't full on stop whats happening, it becomes exponentially harder to enforce anything.
I agree with you, entirely, but I've had developers fight tooth and nail when the build breaks. "We need to get out there ASAP!" is definitely the cry, and usually "compromises" are made, such as "what if we made the overall CI run not fail if this test fails?" — which is as good as killing the test, IMO.
The impetus is necessary, or the problem is ignored.
The devs are really just proxies for the stress a PM is inappropriately pushing, though. But they are paid to not understand this problem, so getting them on board is impossible.
Agreed, but at least with the override mechanism being as visible as the test suite, you have assured that more people will know about it (bringing it "to the surface") and you might even have the override in source control (documented). These aspects are increasing the chances someone will say "Let's just follow the process."
In our case, the override mechanism caused subsequent confusion. People were confused: "why does my CI run fail on [the security test]?" when the security test was set to specifically not cause the larger run to fail. The security test would still red-X itself (to visibly indicate that it was, in fact, failing), but not block the entire run.
But people nonetheless went: "the run failed" -> "that test failed" -> "why is this test failing?" (There was a second failure in the run that was the actual reason the build, as a whole, failed, but that was blindly missed.)
That triggered a large discussion the "compromise" of which was, to not "confuse" people, to have the security test green check itself on failure.
And so now it is truly invisible.
(I've actually turned it back to the "red-X but don't block the larger build" mode since then … but it still causes confusion. I do not know how to further help people who cannot understand the output from a build that has two failures, one of which is failing on master which is on the whole green, and one of which is only failing on your branch.)
The big issue I ran into with zombie flags is that new features were always prioritized over cleanup. Engineering could "fight for time" to get things done, but there were always other priorities that needed to be addressed.
No tool you have will solve that, whomever owns the product team time allocation needs to be onboard with the idea of cleaning up old code.
Shops could have people dedicated to such tasks rather than the usual feature work. Same as we now have devops teams, or even devex teams. For example I actually enjoy refactoring, cleaning up, and tending the garden. And I am sure I am not the only one - but the pressure is always to deliver features, so housekeeping is rarely done.
It's a matter of dev team culture: always leave the code cleaner than you found it. Hire people who understand this, and remind each other through reviews.
As for external pressure from shareholders to skip cleanup: it's better to not describe cleanup as a separate step, but as just another part of feature implementation. It is a fact of software development that work on a feature must be done both before and after that feature is first available in production (whether it's to ensure that all is fine with a finger on the rollback button, to set up relevant production monitoring, or to clean up any scaffolding that was required for the release), and it is possible to have features that are available in production but are not yet _done_. Finishing the work is not optional.
It runs counter to the pressures a developer faces. I've at times been told that tech debt is fine because products only last 3-4 years before a replacement gets made. When you're on that timeline who cares if you've cleaned up after yourself?
You also see this corp-speak from devs, although usually in the form of "I plan to get hired somewhere else before we need to pay this tech debt down."
> products only last 3-4 years before a replacement gets made
Is that your experience?
I imagine it can be a kind of self fulfilling prophecy, but even then I can't imagine people replacing everything each 4 years. And if that's not your experience, then the point is moot.
That's how long it takes before the person who signed off on it has moved to another job and made the product someone else's problem. At that point the new person will usually kick off a new project because the old one is bad and releasing a new product is better for their chances of promotion
I worked at a software shop with very high retention rates (like, 30 years, 10 years on average), and the inverse can also be an issue, "I own this product and it's my problem not yours."
Having seen both situations, I personally believe that it can also be that a new project comes to exist simply because the old one is too complicated to understand; some things you need to work through in order to get.
Someone here on HN said recently, "people forget that the primary job of the software engineer is as a learning agent for the org" or similar, and the more I see, the more I believe it. I used to think it was all about efficient automation, but I'm not so sure anymore.
But isn't spam like that exactly the sort that will ultimately get completely ignored? I probably get a dozen or so such barely-relevant internal emails a day where I work, and have learned how to recognize them by the sender and subject line, and ignore them.
A "leaderboard" email would 100% be one that I ignore as a time-waster.
No because the whole org sees it and the org head can tell your team's manager to get the house in order. We did this at my last company, albeit with migrations rather than feature flags. Same idea.
I believe I read about it in a book (perhaps Software Engineering at Google) in the context of test coverage; using a leaderboard for gamification.
Ahh, so the real target for such an email is management rather than the rank and file? That makes more sense. But surely, it would be better to send email just to those people and put the data up on the company intranet for those devs who are curious.
The point that I was making is that email might not be the best way of distributing this generally, because in many companies (and most of the larger ones), there is a high rate of companywide emails that amount to just spam. This increases the chances that these particular emails will get mentally categorized the same way and ultimately ignored.
I know that I delete unread about 80% of the internal company emails I get because they're not actually useful, and have better uses for my time.
An email like this, assuming that my own projects are not often mentioned in them, would rapidly get ignored, I imagine.
If the goal is to have the affected teams alerted to the situation, it seems like it would be better distributed directly to the teams individually (a targeted email just to the teams involved) rather than spammed to all of the devs in the organization.
> Especially if their PM leaned on engineers to make the experiment "quick n dirty" and therefore hard to clean up.
The PM then just needs to say that they don't have the time to cleanup because they have shit to do, and there's folks out there who pretty much live for this, and are usually well seen by management because they're management's executioner: no matter how shit the idea and execution, they'll get it rammed in.
Our feature flags tend to be "staged deploy" feature flags, and it hits an internal Slack channel when they hit 100% available. This usually triggers someone to queue up a ticket for "rip out the old code".
The tricky requirement that ends up existing and torpedoing attempts to clean up feature flags is a requirement for long-term holdback.
e.g. "Test was successful so it's rolling out to all users, minus a 0.5% holdback population for the next 2 years"
This then forces the team to maintain the two paths for the long-term, ensuring the team might get re-orged / re-prioritize their projects sometime a year later making the cleanup really hard to eventually enforce.
> "Test was successful so it's rolling out to all users, minus a 0.5% holdback population for the next 2 years"
Man, I couldn't imagine being a user in such a situation. "Oh, I guess I'm just not getting the better functionality?" Even worse if I were a paying customer.
It’s actually usually the paying customers asking via support to be added to the holdback, improved experience or no.
This is more true for larger flags that substantially change the experience and may not implement niche or edge-case functionality. Obviously you want to avoid these kinds of tests if possible but it’s not always possible.
I've seen that easy enough to address with a frequent review (quarterly, per PI, monthly, etc). If you're operating in some methodology that has a consistent cadence, it should be manageable but you do have to be deliberate about it.
Totally agree that it is a social problem as much as a technical problem. This is one reason why I had the thought here of FM tools starting to own some "feature management" jobs that aren't typically placed under the devops umbrella, and may be more of interest to product or marketing stakeholders. Perhaps that would do something to help with the issue of getting buy-in to do the maintenance.
This is not a complete solution, but it seems to me an aspect of the solution is similar to the way the programming world over the past 5-10 years has been acknowledging that dependencies carry a certain cost with them that must be accounted for. Feature flags do too. If you account for them as just the in-the-moment costs of adding a flag for something, then you are grotesquely underestimating their costs.
Personally I tend to resist them, for much this reason. I don't mean that I never use them and you can't find any in my code, but I resist them. They need to prove their utility to me before I add them, in much the same way I tend to make dependencies prove their worth beyond some momentary convenience before they are allowed in. There are times they leap that bar, but I think that generalized resistance has helped keep the code bases in better order than they otherwise would be. I've seen other teams who did not resist and they've developed a proliferation problem.
I worked on a team of hundreds that developed and maintained a vertical market app enterprise app for a few thousand client companies, probably more than 100,000 end-user seats, but fewer than 500,000. My small sample size (1) observation is that the developer organizations least able to manage feature flags are the ones most likely to buy into such a magic pill cargo cult solution.
If your software has accumulated or is built to support numerous independent client organizations, it almost certainly has features that are not used by all users, and thereby the software has implicit feature flags embedded in the data that it is already processing. Regardless of whether those feature flags in data work well or work poorly, why in the world would you want to add a second feature-control subsystem? Because it is meta-programming, I suppose, and we all know that meta-programming just adds another level of power to everything, and your first feature-control system may be a little hard to disentangle, and you can make feature-flags work by having the meta-programming done by a select few who really know what they are doing, and it will be a worthwhile challenge, and even if it doesn't work you will learn a lot, and it will look good on your resume, and give everyone a few good laughs when they realize what they were trying to do.
These are real problems but not insurmountable. I think the author does an excellent job of laying out the problem and has pretty decent solutions in mind.
I caution strongly against the proposed solution to fail CI if zombie flags are detected. CI should ONLY fail if there are changes in the branch that cause the failure. Detecting zombie flags (eg, this branch contains a flag which has been turned on and untouched for 90 days) is setting a CI time bomb. Find another way to alert developers of the zombie instead of failing good code at CI time.
The CI time bomb makes developers lives worse. You’re trying to get a new feature released, and some feature that everyone has forgotten about breaks the build and stops you from getting your feature out now.
If the stakeholders won’t allocate time in the roadmap to remove old feature flags then you shouldn’t use feature flags.
My biggest problem with 3rd party feature flag setups is that I have high expectations for them and it is technically difficult to meet all of them:
- local/static access: it should not have to call out to a 3rd party server to get basic runtime config
- unused-flag detection: flags should have three reported states: never used, recently used, not recently used. These will be different from the user-controlled states of active, inactive, etc.
- sticky a/b testing: should follow the logged in user until the flag is removed
- integration with logger: I should be able to use it with my logger out of the box to report only relevant feature flags. Alternatively can provide a packed value of all relevant flags, would probably have to do flag state versioning.
- integration with linter: should warn me if flag has not recently been used or I used a flag in the code that is not in our database (alternatively, will upsert the flag automatically if it doesn’t exist)
- hashed flag names on frontend build: prevent the leakage of information, not a perfect solution, but I would want to avoid writing “top-secret-feature” where we can.
I fully acknowledge that a lot of solutions come close, but I haven’t looked at the current state of things in the last few years so it may have improved.
I think a lot of the solutions come close but don't quite get there. It seems like there's kind of a divide between the open source solutions that are probably more sensitive to the day-to-day pain points of developers and the bigger managed service players that seem to be optimizing for contract size.
Hadn't thought of the frontend build hashing idea - like that a lot
I've got a few friends that work at LaunchDarkly, and from what I can tell, they've got a very good handle on the challenge. Better than the equivalent vendors in my business, anyway. I've had some great talks with the LD people, even though, strictly speaking, I don't get my paychecks from programming, per se.
What brought me into the talks was that the feature flag problem is a similar scope to the central one faced by CCSs (component content systems). By definition, CCS requires the content equivalent of feature flags, implemented in a variety of ways, depending on . . lots of things. That problem is this: both transclusion and conditionals necessarily couples the content to the business or product architecture. Ergo, when the product architecture goes bananas, so does your content system, and you find yourself with documents that aren't meaningful in a linguistic sense, or which just break the processor. This occurs in the content context because the natural language of a unified document is replaced in a CCS with the product or business architecture; how is a document chunked, what business needs do the conditions satisfy, at what support level are document deliverables composed. In a code context, the constructed syntax of the programming language is getting chopped by the conditionals driven from the business side; there's even more variance here regarding how code interacts with business.
So not the same problem, but the same class of problem: regular rules that have to integrate with non-regular, non-linguistic business rules.
I have a tiny chip on my shoulder regarding CCS systems, because I have seen so many years flushed down the "re-use craze" by businesses that had zero business trying to re-use anything. Feature flags are somewhat in the same bucket - a lot of things that a business wants to use flags for should really, really, really be built into the code or abstracted away - but of course a programming language has far richer ways to deal with bad abstractions than a markup language does. Which of course can be a double edged sword.
I took one look at their API SDK and could not make heads or tails of it. Not to mention the liability of their service being down or slow one day, or somehow mocking them in our existing test suite. We let our juniors write our feature flag code in less than a day using very simple ORM/SQL and it's just worked.
> We let our juniors write our feature flag code in less than a day using very simple ORM/SQL and it's just worked.
Feature flagging itself is not that hard, it's all the surrounding features that are hard. I can whip up a database engine in a couple of days, but to make it reliable, full of features, and whatever else, will take years.
I work more in the firmware space, so my experience with feature toggles is always with half-baked tooling and limited ability to change deployed products. We do use continuous development within the organization, so there is still a lot of applicability, but it's always interesting to see the way similar problems get addressed in a higher-level and more online environment.
That said, I'm surprised this article doesn't mention the two words that always come to my mind when I see toggles: combinatorial explosion. Several times I've worked on projects that went way too toggle-happy and decided that new functionality should be split into indefinite life "features". Just in case the company someday wants to sell a model without that feature. Of course, when an old toggle finally gets turned off a year later, you realize that it crashes the system because several other features kind of half depend on them.
> decided that new functionality should be split into indefinite life "features"
Yeah, once you do that you have settings, and not feature flags anymore.
Adjustable settings do come with a high risk of combinatorial explosion. Ideally, you separate the system functionality to control this problem, but that's not always possible.
We're attempting to address some of these problems at https://www.flipt.io/gitops. Having your flags defined as configuration and committed to repository opens up a range of possibilities in terms of static analysis.
Additionaly, we've got a prototype static analysis tool to finding calls to our feature flag clients in both Go and Rust too.
Hadn't seen this, looks very cool! The static analysis piece seems difficult, but even considering that I've been a little surprised not to see more attempts.
Yeah, it is surprising not to see more attempts out there! GitHub's TreeSitter sits at the core of our attempt. Definitely feels like the right tool with the right potential. We plan to open source it sometime soon.
as a pm, there's a whole set of jobs that occur post-rollout that have often been poorly handled at companies i've been at. those include packaging, customer operations like allow-listing long-lived features for certain companies, optimization of bundles, etc.
when we've built our own homegrown system, it's opaque and often neglected. when we've used feature flag tools, we co-opt them to do things they're not meant to support (e.g. persistent toggles in admin panels) so end up with complexity in the code and in operational processes around it.
agree wholeheartedly with points in this article ... there are issues with how we manage flags generally, but we also bias towards assuming that once a feature is live, we can and should move on -- the feature is now persistent, part of a package, and it won't change frequently or ever.
the reality is the feature lifecycle takes on a very different shape, and, at least in my experience, current FM tooling isn't built to accommodate that.
Can you describe the different shape? What does it turn into? Once live I clean up the FF. But, may introduce new ones as the now-live feature gets tweaked.
At my work, I have a somewhat clever (or idiotic) technical solution to the problems of feature flags: they are actually implemented as feature modules that monkey-patch the base application in runtime.
There are a few benefits: removing features is dead simple, just delete the whole feature module, and there's no conditional branching in the base application.
There are some drawbacks too: the base application must have entry points for the feature modules to overwrite. Usually the default values are no-op or some default behavior. Features also must implement setup and teardown, which can take longer to write than a conditional.
That sounds kind of complicated to reason about and thus pretty dangerous for feature flags. At my work, we use them to do partial rollbacks in case the new feature/behavior has unintended consequences. Deployments might take weeks to complete globally, so doing a hotpatch or code rollback is incredibly expensive compared to toggling a feature.
The zombie flags are a huge problem. Management is always pushing for feature completion and its done - behind a feature flag. The complication is now they never want to allow time to remove all the dead code paths later, which leaves you dependent on all sorts of potential things, imports, libraries maybe even connections. One day they inevitably find out something is "still in prod" and they get curious and don't understand why it's still there. Well, feature flags require more TCO, period. They don't want to give you more time though.
Couple of simple ideas for the zombie flag problem:
- When adding a flag immediately file a bug to remove the flag by a certain date. Enforce in code review. The bug count will surface the problem to the management.
- When a flag is past due date start firing non-fatal incidents. The incident count will also surface the problem to the management.
Heh. I've been there and tried to do this with feature flags and a handful of other tech debt work.
It can feel good to make people do their chores, but you can also burn a _lot_ of bridges by forcing relatively minor maintenance work to be high priority like this.
Maybe this is just a terminology mismatch, but when you say "incident" that's been a pretty urgent process everywhere I've worked. Other people are talking about breaking CI, which just sounds miserable.
Raising awareness to management is great though. Especially if you can quantify it in terms of things that matter to them more than "tech debt". Like startup latency or a direct dollar cost.
My story of when bad feature flag hygiene resulted in a real technical problem is when our Redis kicked over one day. We had good monitoring so it was easy to identify the problem: network was saturated at 1 GB/s.
I traced the problem back to the fact that we had 100+ feature flags that were fully launched, but still loaded into the backend when "all feature flags" were loaded for a team. The way this was implemented returned all team IDs that had the feature flag, and the way this was done had some flags with multiple thousand IDs in them.
So 100+ flags, many with 2000+ int entries.
We ended up quickly shipping some code to mark features GA, so they wouldn't be loaded from Redis. Cut usage by 99% instantly.
I work at truckstop.com and I came up with a way of managing feature flags that isn't madness. First I used the feature flags in conjunction with module federation. Then I create 3 flags per product, alpha, beta, rc. They looks something like this: mfe-load-search-alpha. The flags are managed by split.io and then tied to a federated endpoint deployment. Which flag gets loaded is determined by a router factory that selects the route with the correct federated endpoint based on the splits. That effectively allows me to decouple a deployment from a release.
Probably my best story of “zombie flags” was when this guy accidentally deleted a production table. We disabled the feature flag, disabled some code written after it had been turned on and expected it to be on, then restored the table from a PIT backup. Finally, we reverted the code changes and feature flag. We were back up in a matter of hours (the table was hundreds of gb, so it took awhile to delete and restore). Some customers noticed the option missing from their options screen, but 99% of the customers never noticed the feature downtime.
One thing I see missing in this article is another huge cost to these things.
What happens when your homegrown feature flag microservice (because why pay for a hard cost when you can have the soft cost of making your own) goes down, even temporarily.
Sane defaults at code review time, before launch aren't always the sane defaults after a feature has fully launched, or nearly fully launched.
I've seen more than a few egregious outages due to a feature flagging tool being down and taking the user experience back a year or two.
At my job feature flags (and other configuration) get distributed as static files that replaced by config updates, so if there’s ever a disruption the hosts still have the last valid code configuration values
It's homegrown. This paper talks about it: https://research.facebook.com/file/877841159827226/holistic-...
It also mentions Gatekeeper, which is the rule-based engine I mentioned built on top of it, but there are other feature flag solutions that use configerator for different use cases like killswitches or gradual rollout.
It seems like one of the biggest problems is prioritizing the cleanup of old flags. I know some companies have developed tools like Piranha[0] to automate this process and a few of our customers at grit.io have used it for that as well.
Would love to hear if others have had success with automated flag cleanup.
I worked at bloomberg for an extended period. Feature flags are seriously used eg. >10k flags added per month across all code. Now, they came with ample management systems to enable/disable, rollout, and check for complete rollout. Various techniques (shared memory, caching) were used to drive down lookup time.
Removing them came down to team discipline.
Ideally, a Google like clang analysis of code would find flags ready for removal and alter code to remove the old code path. Recall Google used tools like this ro update or migrate deprecated api calls
Bbg however never got there. Instead you'd just get various alerts
Modern feature flag tooling (eg. LaunchDarkly) cover most of the uses here. It'll even tell you whether flags are useful or not (if you push evaluation data back upstream).
Good point. It's possible the real issues have more to do with price point/positioning and product UX.
My experience with LaunchDarkly has been that a lot of these hygiene-related exist only in their top tier enterprise plans, and even below that point the cost of the tool starts to draw attention.
On the product UX side - I've found these tools are designed for engineering/Devops users but (whether by design or not) by product, success, and some ops users as well.
If you want a repeatable task done properly every time you give it to a computer. In this case, manipulating feature flags is a task for partial evaluation / staging. It's relatively well known in the programming language research community but hasn't made it into mainstream production languages. No amount of social process will ever be as effective.
This is a social problem as much as a technical one: even if you have LaunchDarkly, DataDog etc making very clear that a flag isn't used, getting a team to prioritise cleanup is difficult. Especially if their PM leaned on engineers to make the experiment "quick n dirty" and therefore hard to clean up.
At The Guardian we had a pretty direct way to fix this: experiments were associated with expiry dates, and if your team's experiments expired the build system simply wouldn't process your jobs without outside intervention. Seems harsh, but I've found with many orgs the only way to fix negative externalities in a shared codebase is a tool that says "you broke your promises, now we break your builds".