One of our internal experimentation systems handles this by capping rollouts to 95%. So the author is forced to clean up their code if they want a full rollout.
I'm sorry, but I don't entirely get this. Does it mean that after I've finished something I want to merge, I have to remove 5% of that? Or of the entire code base? Or something completely different?
I think they're saying that any new feature controlled by a flag is only rolled out to a maximum of 95% of the user base. To make sure it gets rolled out to 100% you have to remove the flag controlling it (and therefore the old version of the code).
I'm not doubting it's impossible, but I can't come up with any scenarios where something would work perfectly at 95% but then break at 100%. What could be some examples of things that can break like that?
I saw something like that one time - the new version had one rarely-used broken API endpoint; clients who hit that would silently retry until eventually hitting an old instance which worked and then they’d be on their way. It was rarely used enough that the total number of retries didn’t trigger any alarms, and failed fast so that failing 19 times before working on the 20th attempt didn’t cause any timeouts.
We now have monitoring of success-rate-per-endpoint, so even if 99.9% of requests are successful, if a single function crashes 100% of the time we should still notice :)
If you have a marketplace of some sort, and users of your new code (at 95% rollout) cannot see postings made from some subset of other users, but the 5% on the old version still can, you might not notice that volume of sales of things posted before the rollout has dropped by 95% where you would notice if it dropped by 100%.
Most of the time this would happen, the difference between "95% of people can't see a posting" and "100% of people can't see a posting" would be pretty small, but I guess if the marketplace is very liquid the difference might actually be substantial. But unless you're uber or someplace like that, where there's a big difference between "95% of drivers can't see a ride request (so it takes slightly longer to find a driver)" and "100% of drivers can't see a ride request (so no driver is ever found)", I can't imagine too many cases where the difference is likely to make a practical difference.
This would assume that all types of activity are evenly distributed across all users. In practice, I've seen many things where a small, small subset of users are responsible for a disproportionate amount of volume, and the odds of all of them being in the 5% allocation of "old" code is higher than you'd think.
Assuming your rollout is truly randomised, I find it highly unlikely that an absolutely random 5% of your users (randomised everytime as well) are going to cause your fall. It's still possible, but I'm not giving up on an elegant idea for that reason.
Race conditions can definitely do that, speaking from experience. I’ve fixed some that have been in the source code for a year before they were discovered.
Brilliant. This reminds me of the apocryphal tale where NASA spent millions developing a pen that could work in microgravity… and Russia just used a pencil.
Yup. Completely false. The Russians ended up using that pen too or their own version of it; the Americans started off using pencils but it turns out having electrically conductive graphite shavings floating around is bad for electronic equipment.
The flag management system has information on activity pertaining to various flags. After prolonged inactivity, a flag is considered stale and a diff is generated. This is a heuristic (in the absence of expiry date for a flag) and the final decision on cleanup is made by the flag owner.
A while ago I deleted my Uber/UberEats account (they didn't make it easy btw) because of how utterly unusable their apps were compared to competition. After me and my girlfriend spent 30 minutes trying to order food on 3 different platforms, which has been a repeated experience with them for years, I decided it's time to give up. Same obvious UI/UX bugs as over a year ago - I'm actually dumbfounded at how this is possible, do they even use their own app? To sum up, I tend to look at tech coming from Uber through a somewhat sarcastic lens these days.
Can someone point to a good write-up on how to introduce and start using feature flags in a repo? Seems like I just can't find any good resources on this topic!
I'm a Developer Advocate at LaunchDarkly, and we've written a bunch of different guides over the years. For a good intro that discusses first steps with flags and also how to approach testing flagged code, see our eBook "Effective Feature Management"[1] (Download page is right behind the form, you don't need to wait for an email or anything)
One of the best ways to start is add a simple kill-switch to low-impact code. (More about that in the book.) But, I have to ask: what are the main issues that are blocking you from starting? Is it that you need ideas for cases, or are worried about breaking something, or just want to ensure best practices from the start?
We're constantly writing new guides, both as blog posts[2] and long-lived guide articles on our documentation site[3]. Feel free to message me on Twitter[4] if you need more help!
If you're looking to try it out in a simple way, you could probably dynamically load different functionality or modules with environment variables. But I'm also looking for a good resource
This is what we are trying now, but that can't do everything. Some teams use launch darkly, but I'm also looking at what level of abstraction we should try to do, and things such as testing both branches, etc. It seems an important architectural refactor to go into feature flag territory yet I see very little discussion on best practices there.
Piranha co-author here. I'd say you trust it a bit more than we do, then ;) As the paper and blog post mention, we use Piranha to create diffs/PRs against the original author of the code. Developers are expected to code-review the changes before they are ever landed. As used within Uber, Piranha won't ever auto-land code deletions.
That still saves the time of: a) remembering that the flag is stale and must be removed, b) actually removing the code and putting the diff up. Piranha diffs for a single flag are also generally small enough to fully read them during code review.
We do have comprehensive unit tests for both the Piranha tool and the target codebase(s), and the majority (65%) of diffs generated by Piranha are landed without changes (and the most common change in those that are changed is to delete extra lines that Piranha couldn't prove as stale). Thus far we haven't had an outage caused by a Piranha deletion, but certainly there have been incorrect diffs generated and caught either by CI or manual reviewers, requiring us to update the tool. We would not recommend landing diffs generated by Piranha - or other stale code removal tools - to master without any reviews right now :)
Follow up question: In general, when a diff is generated, is the standard process to just do a manual code-review and CI, or do you also test it manually to check for unexpected side-effects?
Manual code review and CI only. In our experience of deleting more than 2.5K flags, testing would have helped in one case but then the code was not tested well when the flag was introduced.
Just as an extra note: there are manual testing steps (and automated end-to-end test, and internal alpha test, etc), independent of whether the diff was Piranha- or developer-authored, but all those happen for the continuous delivery internal version of the app, which is after the diff has been landed to master, but before a release is sent to app stores. We would count an issue discovered there as an outage having made past "our tests", even though it could well be caught before it gets to any external users.
Neat! I experimented with something similar and found the state of java AST transformers to be in a kind of painful state. I'll have to look though the code base to see what y'all used.
We used Google's Error Prone framework. Pros: you get full compiler symbol/type information, in addition to the AST. Cons: you need to build the code to process it (Error Prone runs as a javac plugin), and if you want to do things efficiently then you are restricted to basically a single AST traversal. It's very use-case dependent whether you are better off just using e.g. javaparser.
How would you introduce feature flags in a way that doesn't leave any debt behind in contrast to their approach?
Or are you saying that feature flags themselves are inheritly debt-generating (which I think they acknowledge, but are taking it on prudently & deliberately) — in which case, what would you do instead?
Actually what I meant was this tool is kind of a after thought of bad feature flag life cycle. When you set a feature flag, normally you should put a deadline on it, after this deadline, you should either choose to keep it, or remove it or you can extend the deadline.
But seems like they accumulated a lot of flags, which stayed stale over time, and went unchecked and lack of feedback created the debt.
Also unrelated but I prefer 2 stages for retirement of flags, from production and from code. From the example in the article. Instead of using direct access to:
experiments.isTreated(RIDES_NEW_FEATURE)
I think it is better to indirectly refer this from a function like (considering we are in context of RIDES module/class):
function isNewFeatureEnabled(){
return experiments.isTreated(RIDES_NEW_FEATURE)
}
So when you want to retire this flag to disable new feature but want to keep code for some time (for few versions), but don't want to bloat your binary you can simply do:
> When you set a feature flag, normally you should put a deadline on it, after this deadline, you should either choose to keep it, or remove it or you can extend the deadline.
This is actually a feature we did identify as important to have in order to increase Piranha's effectiveness, and it is being added to our internal flag tracking, but it doesn't negate the need for the tool. An expiration date makes it easier for the tool to know when to run on a given feature (rather than using heuristics based on % rollout and days without changing the experiment), it still means that Piranha: a) reduces manual effort by auto-generating the (candidate) removal patch, and b) acts as the reminder portion for the expiration, so it's more actionable than just adding a task.
The thing to note is that, even on a steady state where flags are removed as soon as they go stale, enough new experiments are being created every day that reducing the time spent cleaning them up is valuable.
Also, you definitely don't want to block someone from fixing a crash because they have pending expired flags, so all you can really do with any expiration policy is to remind them. With Piranha, you are reminding them and reducing the friction to solve the issue. After all, the diff is right there for them to review and click 'land' on.
As for the hardcoded flag value in your example above? What does that accomplish? It looks to me like you'd only be shipping dead code, since there is no runtime way of re-enabling the `RIDES_NEW_FEATURE` behavior (which is the main difference between "rolled 100%" vs "Piranha-removed"). It also makes harder to remove the related code later, since the semantic information about it being part of the feature is lost. If it's just about having the old code available, then version control does that already, no? What am I missing?
I am in favor of all automation and I totally agree that the tool is valuable. I am sorry if my comments seemed on other direction.
What I was trying to say was basically, without tooling also you can manage the debt from feature flags.
Answer to the RIDES_NEW_FEATURE question is about "you definitely don't want to block someone from fixing a crash because they have pending expired flags" mainly.
When I am disabling a flag, if I set isNewFeatureEnabled to False, basically, I am removing bloat instantly. Then when I have time to review the code I can also remove the dead code. Actually this is fixing the concerns in your blog post about "accidental activation" and "bloat" without waiting developers to fix the code. Piranha can set flag to stale value, then later can send the developer task to fix the dead code.
My flow is little bit more complicated then I replied actually, I have also assets etc related to feature flags. CI pipeline also removing non-used assets for that flag when it is stale. So basically I have more like: function isNewFeatureEnabled { return isFlagEnabled(flag) && getFlagValue("RIDES_NEW_FEATURE") }
What I am curious on this topic, do you have any kind of conflict detection for your feature flags?
Basically sometimes I have some conflicting flags that can introduce bugs. Especially some rarely used flag and a new feature. Basic example, 2 different flags, setting same property to different values on an UI object.
Although more testing coverage probably can help, but I am curious, if you have some automation to detect those cases.
Most of the existing Android codebase is Java. There is some Kotlin code, and PiranhaKotlin is certainly in the roadmap somewhere but, especially when the goal is removing older features, Java makes the lion share of the use case for such a tool right now.