Hacker News new | past | comments | ask | show | jobs | submit login
The Cult of the Root Cause (reinertsenassociates.com)
180 points by fanf2 on May 19, 2018 | hide | past | favorite | 66 comments



It's also worth pointing out that you may not be able to fix the root cause. The sailing example is great here:

- Can you fix the crack in your hull whilst you're out at sea? Almost certainly not.

- Can you even tell you have a crack in the hull until you've reduced the water in the bilges by running the pumps for a time? Again, quite possibly not.

Treating the symptom is really the only sensible option, unless it's serious enough that you need to put out a Mayday. Again, not a course of action that addresses the root cause but, in some situations, absolutely the right thing to do. To take things up many, many notches, the sinking of the Titanic was an appalling tragedy from which relatively few people were saved, but I guarantee that nobody at all would have been saved if the people on the ship had opted for a series of committee meetings about how to solve the problem of the iceberg. (Not to say there weren't a very large number of hideous blunders in the management of that situation going all the way back to the ship's design and fit-out.)

Moreover, another problem with Five Whys is that, applied heedlessly, it's an extraordinarily arrogant philosophy, because it makes the assumption that you can know the answer to those five whys. Often you can't, at least not without going on a journey, and fixing a few things along the way, and without that, you can simply be wasting your time trying to answer questions that you can't answer in your present situation.

And that really, and somewhat poignantly, cuts to the root cause of why I view the kind of thinking frameworks/fads popular in business with a degree of scepticism: over-applied or misapplied, they paralyse people into inaction, and thereby provide fertile stock for breeding mediocrity.


If you get the iceburg as a root cause then you are really doing something wrong; in fact I would suggest the point of a 5-Whys would be to move people past thinking that the iceburg is the problem and get to the actual root cause :P.

Why is the ship sinking? Hit an Iceburg. Why did we hit an iceburg? [Equipment or command failure] What is wrong with [maintenance strategy/command structure] that allowed this mistake to slip through? [technical details]

If you have a team of people and something goes wrong, it is overwhelmingly likely that a human could make a decision differently or do some task that is not being done that would mitigate the worst of the damage.

People absolutely are saved by the committee process, in the same way that items tend to roll downhill. Pretending that humans have perfect control over their environment and could have done something more has proven remarkably successful at getting results. It isn't very impressive, it feels very unreasonable, and it isn't going to work on its own, but it is a very useful tool to let people to stand up and ask "sure something is going on that is out of our control, so why aren't we ready for it? This happens sometimes and we need to be prepared".

Basically, if you just ask why 5 times without any sense of personal responsibility you'll get stupid results. True of any process. But if an uncontrollable event has impacted your endeavor, it is absolutely worth asking "why are we exposing ourselves to a risk we can't control? could we somehow have avoided this".


The rule of thumb I give people is that if there is a different path that gets you to the 5th why in two steps it’s not really the 5th why.

However, it’s still very possible to set a narrative with the five whys, because you can steer it to different contributing factors (most problems really do have more than one cause).


Exactly which part of this is supposed to help the ship that just hit an iceberg and is in the process of sinking?

If your answer is "the relevant committee would have met before the accident" or "the committee would reduce / mitigate future accidents," you have missed GP's point.


Root Cause Analysis isn’t really meant for active emergencies like that where you can’t take time to analyze “why”. It’s a retrospective tool. If you’d like a mental model in your toolbox for an active emergency the OODA loop[0] is well tested.

RCA is good for figuring out how to keep others out of the mess you’re in after the fact.

[0] https://www.fs.blog/2018/01/john-boyd-ooda-loop/


While the article starts to digress into a pseudoscientific mess at the end, this method of problem solving is pretty damn effective when you're under the gun. While you don't have time to deeply analyze the situation, you must make some time for information gathering and analysis, whether your deadline is in 5 minutes or 5 days. And the higher the stakes, the more important it is that you don't take shortcuts.


A good 5y is run after all immediate damage has been patched and triaged, and preparation for the 5y is gathering all the evidence needed for the various theories.

5y is expensive to do right. It means that organizations with lots of failure must spend time reflecting and investigating and fixing, rather than shipping new features.

That is the point!


The tech industry is badly re-inventing processes that the rest of the world developed decades ago. Other industries have learned to deal with far more complex and serious issues of quality. If there is arrogance in our approach, it's our failure to learn from other industries; lots of people in the tech industry have heard of the Five Whys, but very few have actually read Taiichi Ohno's Workplace Management, W. Edwards Deming's Out Of The Crisis or Walter A. Shewhart's Economic Control of Quality of Manufactured Product.

I frequently recommend this lecture on the Piper Alpha disaster, a fire on an offshore oil rig that killed 167 men. It eloquently summarises the findings of the Cullen Enquiry, a six month study of exactly why the disaster happened and what could be done to improve safety in the offshore industry. The enquiry found a complex and interconnected set of factors encompassing process, training, culture and design. It is densely packed with lessons that can be applied to our industry, not least of which is the idea of conducting intensive and systematic inquiries into major failures.

https://www.youtube.com/watch?v=S9h8MKG88_U


Thank you for this very considered response - I've enjoyed all of those books! There is a tendency to read secondary sources that are presented in a current consumable format. Rare is it that someone has read the primary sources or even referenced them. As a gentle introduction to the impact of engineering for quality, I give people copies James P. Womack and Daniel T. Jones's The Machine That Changed the World : The Story of Lean Production. If this leads to questions for further study, I point them to Ohno and Deming. As for the topic of this thread, Root Cause analysis is only one method of problem solving and other systemic methods may be more appropriate depending on the issue.

For a thorough and in-depth treatise on critical thinking and methods, one could read Problem Solving, Decision Making, and Professional Judgment by Paul Brest and Linda Hamilton Krieger. They provide examples of many decision making frameworks, cognitive biases, in a readable format.

I did appreciate the link to the Piper Alpha resource. For another read along these lines there is the BEA Air France 447 disaster final report. https://www.bea.aero/docspa/2009/f-cp090601.en/pdf/f-cp09060...


The 5 Whys is just meant as a simple tool easy to use, of course it won't solve your big picture problems and nobody is questionning that. You're raising a strawman. Any methodology applied heedlessly will lead to bad result.


This is exactly right. It was meant for people with (a) a high school education to (b) fix problems in a repetitive manufacturing process where if you don't fix the problem on this car, it will appear again on the next car, and the one after that.

If your damn boat is sinking, you should just get to safety ASAP, but if you're in the boat making business, it may be of some interest to know why your hulls are cracking.


Your missing the point, if some critical piece of equipment is out of alignment and you fix that then the next car is not going to have an issue.

Five why's is about avoiding the next case of a piece of equipment being out of alignment not the current case. You can have a reasonable policy of waiting 3 weeks before doing 5 why's so you have a better understanding of what's going on.


Just want to agree with you. Treating the symptom during an event really is about the best you can do.

I'll also agree with the "you may not be able to know a why." For systems, that can be a good guide on instrumentation to add. Sucks, in that you won't prevent the next event. Good that you should be able to get more from it.


Much depends on context. If you're the captain of the Titanic it doesn't make sense to focus on the iceberg as the root cause. If you're the CEO of White Star lines it certainly does.


This article is creating a straw man. You don't do a 5 Whys and fix what you think the root cause was, you fix the issues that were most serious. If there are problems too big to tackle immediately, put in short term mitigations and incorporate your new understanding of the system's reliability into your future plans.


Doubly so because you don't use the 5 whys during the emergency. You use it during the post mortem. After you've unplugged the burning computer. After you've gotten the ship out of the immediate existential threat. If the computers are burning due to faulty wiring no amount of triage is going to stop that from happening next week. If ships are getting holes because the currents have shifted and bergs are appearing in places where hobby sailors frequent then the maps and some public outreach are the right solution.

I dunno who taught these people about the 5 Whys, but someone (possibly themselves) has done them a tremendous disservice.


More importantly, it's about not starting to fix things until you understand the context. The fix might be at any of the levels, but it's pointless to speculate where the fix goes if you don't understand the whole system/network/causality chain to an adequate degree. An "adequate degree" might be shallow overview, a deep dive, or anywhere in between, but it's still necessary.


Couldn't have put it better myself - this article was clearly written by someone who's worked in an entirely disfunctional environment - or has read an article on "five why's" and missed the whole rest of the problem management framework.


If there is a cult of the root cause, I have yet to meet it.

Here's what doing an exercise similar to 5 Why's gets you:

- An understanding of where issues come from. Whether your plan of action starts from the top, bottom, or middle, taking the time to step back and broaden your perspective before you jump in to fix a problem helps to make sure you're going after the right things for the right reasons.

- A culture of _not_ just picking the most expedient and facile solution every time issues come up, and going with that. In companies I've been in, the pressure is almost always on to find the dumbest, hackiest, absolutely fastest path out of trouble. Spend multiple years solving every problem that way, and you are in deep trouble! It takes institutional courage to push back against that, and having a practice in place to force you to stop and think now and again gives you an opportunity to summon that courage.

- A culture of ownership. This seems a little counterintuitive to me, since if you follow root causes deep enough you're liable to stumble onto people and process problems that are way out of your control and pay grade. Looking at root causes this way, you might think it's a process of passing the buck. However, by shining a light on such things, and finding people to address those things where they have no owner, you can push towards a better collective ownership of the real issues that face your company.

No good management idea is free from abuse, of course. You must exercise taste and judgment in deciding how deep to push with root causes, and what to do with the discoveries. I would think it's rather self-evident that 5 Why's doesn't mean you always ask exactly five questions in a strictly linear pattern. But for heaven's sake, make sure you ask more than one!


I would say it's even more basic than that: the 5 whys are a way to push people to gather information and talk to one another before making decisions about solutions. The point is not to achieve perfection, but to consistently not make stupid and easily avoidable mistakes.


I've never taken the 5 Whys literally. It's obvious to me that all root causes are not '5 Deep', therefore, this can't be a literal objective. I see it as a metaphor for being an effective problem solver, as a reminder to second guess the cause I've identified and ask myself or my team, "Is there a deeper, underlying cause?".


The most interesting cases are complex systems where a fault/even results from a combination of multiple factors. In that case, asking for a "root" cause is unproductive, and a better question to ask would be: how might the system be patched, with the least pernicious side effects. This is also why I don't always like the drive for more "accountability". As can be seen in the other HN thread on the front page about data driven medicine and the side effects on the healthcare system, you need to be very careful about which of the causes you decide to intervene on.

When you closely manage something to reduce variation, you also lose any information you might get about the system from the variation of that quantity. This point is nicely made in another post on the same blog: http://reinertsenassociates.com/the-dark-side-of-robustness/

Especially with reflexive systems (involving humans), the appropriate response might sometimes involve performing no intervention, or performing an intervention downstream, to modify its assumptions about what it receives (eg: adding error handling).


My ops postmortem template tried to elicit breadth. Once you’ve got a forest of causes, you can apply Five Whys to add depth.

There’s some overlap among the following questions. The intent is to elicit observations and ideas, not to uniquely categorize them.

* What are all the factors that could have prevented the incident?

* What are all the factors that could have detected the issue before production?

* What are all the factors that could have detected the issue sooner when it did occur?

* What are all the factors that could have accelerated mitigation? (Including, especially, changes that could have reduced the risks of mitigations considered too risky to apply.)

* What are all the factors that could have accelerated remediation?

* What could have reduced the scope or impact?

It’s common to come out of this with a laundry list that overfits the last incident and, if applied, would increase the complexity of the system and add risk. We’d typically apply one or two fixes, and stockpile the rest to see if any of them would have addressed any future incident. Usually most of the “solutions” turn out to be specific to the single incident that prompted them.


5y is about finding ways to prevent the conditions that let incidents happen, not just preventing incidents, tough. That's kind of the point of asking 5 why questions. "We fell over because we lost a database server and didn't have enough spare capacity to run peak load on the standby. We don't have full N+1 because it hasn't been funded. It hasn't been funded because the business didn't have a good model for the risk adjusted cost. Action item: add the cost of this outage to the next budget forecast, and add a requirement for risk adjusted cost estimates to all future financial plans."


My favorite, goto, on this one: https://blog.acolyer.org/2016/02/10/how-complex-systems-fail...

The truth is that by “fixing” the root cause you will sometimes destabilize the complex system you are running.


Yes, root cause analysis and corrective action should only to be done with Cook's insights in mind.

"Post-accident attribution to a ‘root cause’ is nearly always wrong." "Post-accident remedies usually increase the coupling and complexity of the system. This increases the potential number of latent failures and also makes the detection and blocking of accident trajectories more difficult."

How Complex Systems Fail is short but loaded with value; if you haven't read it, go do so now!


I agree with you. Root cause analysis should be informed by an understanding of complex, dynamic systems. The article's assumption, however, that RCA and systems thinking are somehow at odds is incorrect. Root cause doesn't necessarily, as the author implies, mean a single, isolated cause. It can designate the linking of "multiple contributors" as the author advocates.


> Second, it assumes that the best location to intervene in this chain of causality is at its source: the root cause.

It doesn't though, that's just a built-in assumption for the lazy. The point of Root Cause Analysis and the "5 Whys" isn't necessarily to get to the root and fix the root...it's to provide a framework for traversing a problem set. The point of this methodology is so that you traverse the problem, understanding each step along the way...not that you simply jump to the root and try to fix it blindly.


Most of these examples don't seem relevant to RCA at all.

> shifted from pumping, to plugging, to hull repair

Pumping and plugging are immediate response, just like a decision to temporarily shut down the website when a compromise is discovered, or pulling the plug on a smoking computer. What do these have to do with root cause analysis?


The point was that, sometimes, the best thing is not to worry about finding the root cause -- not now anyways -- but, instead, to "treat the symptom".

That is, when your boat is sinking, immediately focusing on fixing the root cause (the crack in the hull) is not necessarily the best course of action. Instead, treat the obvious symptoms that you can to "stop the bleeding" (plug the hole, pump the water out) and you can deal with the root cause when you're in a better position to do so (back in port, not miles out to sea).

See also: metaphor.


Are there actually people on the other side of this argument? Like, is there someone who is saying "Immediately do a root cause analysis — just that, and nothing else — no matter what the situation is"? This seems obvious on the same level as "a piece of kale is not suitable for use as a pacemaker."


Unfortunately, there are folks who jump right to the RCA, with often devastating impact upon their ability to deliver functional products reliably and on time. Moreover, when doing an RCA, you need to be able to objectively examine every aspect of the decision making, implementation, and execution tree. Not allowing every decision to be questioned and reviewed means you enable a route for poor decisions driven by other than business/technical needs to enter in.

The irony in this, is that these non-questionable, sacred-cow like requirements, which cannot be investigated without causing sturm und drang as well as profound political problems, are often themselves, the root cause of multiple/many problems that one encounters. That you cannot investigate the entire space of issues means that there will be regions of terra incognita for your analysis.

I've seen this at every single company I've ever worked at. Including my own startup. Much of what I learned about this came from my time at SGI/Cray, where specific holy grails could not be questioned.

It is very ... very ... hard, to admit that your systems could somehow be problematic. It is tremendously humbling to see conclusive proof. It changes your engineering mind set when you do accept this, from being defensive to something better. Your architectures improve, your assumptions are less WAG-ish.


Yes and no.

First, I think the author is (as someone else said and as you seem to agree) raising a strawman.

Second, I don't think there are people on the other side of the argument, but there are people on the other side in situ: I mean that even if rationally they would never admit it, they hijack the immediate "fire extinguishing" or "pulling of the plug" or "pumping of water" to discuss where the problem is coming from.

I've seen those people lacking discernment, not only do they get in the way of the short-term action, they also are pretty bad at cause analysis and confound it with blaming people or "I told you so".

This is a generalization of course, but the TLDR is that even if those people wouldn't argue against it, they act against it.


This. Its highly problematic when these people are in positions where they are able to derail efforts to stabilize by insisting upon RCA first-and-only.

Seen this everywhere. Been on both sides of this. Learned to be humble from the experience.


As the author apparently felt that he needed to write such an article, I am assuming there are -- or were, at least (the article is from 2011).


I totally appreciate OPs insights here. It can be so tempting to see things linearly. But obviously the real world is anything but. There's a curiously wicked theory out there by David Abrams that part of this seeming human predilection for linearity has occured somehow through neural conditioning involved with adopting systems of writing, compared to the crazy fractal ocean of sensory input that living in a real living landscape begets. The writing has a start. A finish. It progresses in a single visual direction, and what's more, the pieces themselves only represent reality. The letters themselves are completely different from the things they are describing. And they might be one of the first objects in human history that work like this...

Anyway, the book is called the Spell of the Sensuous, and it's dense af, but bursting with fascinating lines of inquiry.


People are notoriously bad at playing five-whys. If you haven't reached an ideological or metaphysical problem by your third 'why' then you aren't playing well enough.


Once again that's a problem of work culture. the five-whys is super simple, it's up to people to be smart using it. If I had an ideological problem after a 5 whys, my team would kindly ask me what the fuck I am doing


My experience with Root Cause Analysis is it's often an exercise for poor managers to deflect accountability by diluting it among more junior people.

However, 5-whys is very useful as a design principle instead of using it to respond to failure.

It goes something like:

0: build a thing why 1: because customer asked why 2: because thing is what they think they need why 3: because it is one solution to a gap in their ability to achieve something. why 4: because that something is an economic need. why 5: because market opportunity to fill that gap with something, maybe this thing, maybe something better.


I suspect that this is because a human propensity exists to allocate and apportion blame, particularly in cases where the interactions involved in the immediately prior actions are unclear.

I have actually not even once encountered this 5 Why's method.

I neither heard of it before.

But I founded my company in 1996, starting out with almost two hundred years of experience surrounding my incredulous and lucky younger self, including several PhDs and former Fortune 500 board members.

I will hazard that this 5whys technique is fundamentally flawed and easily susceptible to manipulation for procuring a scapegoat.

I only hope that explains why I have never encountered this before. I hope moreso I can feel a little like something​was going on in the right way, in my business, to filter and reject what I think is, and definitely comes across as bogus to me.


The biggest problem with the 5 whys is that the whole concept is taken out of context. In a manufacturing context every repeatable problem is a problem with your process. So in that context it makes a lot of sense to search for a process change that resolves your problem.

Let's say you make light bulbs, and every fifth bulb comes out misshapen. You would use the 5 whys to trace it back to the molding station, where you discover that bulbs cool at a different rate in one of the machines because the mold uses a better insulator. You could stop there and replace the insulator. But if your job is to increase yield, then you can save the company a lot of future money if you figure out how that mold got there in the first place. You might find that purchasing subbed a cheaper replacement based on an incomplete spec. Or you might find that the supplier recently switched materials.

The point is that when you're trying to establish a controlled, repeatable process, you need to understand where your controls break down.

Once you understand the process problem, then you make a business decision about what to change. It was never meant to be applied to R&D problems. R&D processes are not as concerned with repeatedly doing something correctly. They're concerned with making sure something can be done in the first place. It's a different class of problem.


This reminds me of what we've been doing for a while, the "blameless postmortem." The technique is championed by Etsy, and you can read more about it in an article that introduces their debriefing guide:

https://codeascraft.com/2016/11/17/debriefing-facilitation-g...

From the linked PDF from the article:

> “Adaptability and learning. We learn through honest, blameless reflection on lessons and surprises. We believe that traditional root cause analysis makes learning from mistakes difficult. Our blameless post-mortem process is a widely-cited technique that we believe is becoming best practice among organizations that value innovation. Blameless postmortems drive a significant percentage of our development as we analyze what about our production environment was less than optimal and rapidly make corresponding adjustments.” (Etsy, Inc., 2015)

The idea, boiled down, is to inspect timelines, procedures, and actions and develop a narrative of how an incident came to be. The goal is for everyone to walk away with a (better) understanding of everything. With this, people are better armed to put into place solutions.

One example from the text is where an engineer pushes out a change because they thought the build system had zero failures. The push breaks the system and causes a regression that should have been caught in the tests. During the postmortem, the engineer says, "I thought the tests had zero failures. I guess I need to be more diligent in the future." Upon further timeline investigation, it is noted that the tests actually had eight failures, but the font had eights and zeros looking very similar. The fix was not "be more diligent;" the fix is maybe to have a better font or use colors for pass/fail.

Overall, I like the ideas proposed in this blameless postmorem style. It runs counter to the natural tendency to "find a problem and fix it" because it feels like we are talking less about the problem and the fix and talking more about the narrative of the failure. But what I've seen is folks gaining better understanding of how everyone else works, learning about tools and tricks, and about assumptions. And knowing more about the narrative leads to better solutions.


> Our blameless post-mortem process

Theirs?

Accident investigation agencies such as the AAIB and NTSB have been following "blameless" processes for decades. Find the causes and save lives. Who pressed the button or forgot to connect the oil line is irrelevant compared to the fact that the failure modes were possible.


Yes, their process, as in "the process they have implemented and documented", not as in "they've invented the concept". They widely reference prior art and experiences in other fields (e.g. the second sentence of the post the parent linked)


Shouldn't it be the cult of the single root cause? Most problems seem to have several contributing factors. You can almost randomly pick one factor, improve it, and the whole situation will get better.

You see this a lot in public debate like education or health care. Instead of fixing one of the many problems a lot of time and energy is wasted on finding THE root cause that will fix everything.


This was what came to my mind as well. The fire strawman presented exemplifies this. Finding the root cause is simply a mental model. Discovering that there are three and picking the best one to fix (assuming you’re limited by time/money/complicity/etc or it’s undesirable to fix others) is 100% ok. The existence of instances of multiple root causes does not really say anything negative about Root Cause Analysis.

In cases where you can’t discover the root cause (I.e. a plane that explodes and destroys the root cause) you simply have to go as deep as is reasonable and work from there.

If someone is unwilling to be reasonable and accept a number of root causes between M-N then the issue is with them, not Root Cause Analysis.


In my view, what we see in public debate on these issues, and indeed most political issues, is a sole focus on the symptoms of the problem and never an analysis of the factors causing these symptoms. Adding to the dysfunction, no analysis of potential second-order effects of the symptom-treating solutions is ever done.


I wouldn't take such an advice. For me getting to the root cause of things is one of the core values of a good programmer (and for operations as well).

Cause #2 is also fictitious, the 5 whys never say anything about fixing the problem, only understanding it. In fact, for me, not fixing the problem is just as valid a solution - as long as you know what caused it, you can determine if it's worth fixing it at the root or even at all.

As to the linearity of cause and effect, while it's true that many problems have multiple causes, a solution to a linear problem will prevent alternative causes below it. Besides, the grand majority of issues arising in mature systems arise from a single cause and have linear cause and effect.


Reminded me of this entertaining clip:

http://vooza.com/videos/the-5-whys/

I doubt that most people who use the five whys really take it to be as simplistic as the author of this article suggested, but to those that do, it's a good wake up call.

In fact, life is even more complex than the article suggests, when you throw in effects of chaos, feedback loops, missing information, unknown influences - just to mention a few. Still, tracking down the order in processes has got humanity quite far (at least as far as being able to predict and engineer accordingly) so it's obviously effective.


To save some time watching the vid. It's 'because the Illuminati or something'. It is a great little sketch and does illustrate the major objection to the OP's essay.


Seems hand-waving, thin on value and promoting a consulting business.

It would’ve been better to talk about the real-world including TPMS and the NTSB investigation approach... making cars very reliable and very complex aircraft safer with strict regulations.


For a deeper look from the systems engineering side, check out http://mit.edu/psas, specifically the book-length treatment in “Engineering a Safer World”.

For applications of similar ideas to cloud software and devops-style environments, https://www.kitchensoap.com/2012/02/10/each-necessary-but-on... is also helpful!


I think that 5W and 5W2H are means to develop a network of logical reasoning. These diagrams ( fishbone and Ishikawa) are used as a means to discuss what attack areas you have concerning a problem. They are used in many ways and in many different fields. They are very useful when designing experiments (DoE). 8D and 10D teams use them extensively in (high tech) manufacturing.


> There are often multiple causes for an effect,

I find that some of the toughest bugs to solve are the ones where the undesirable effect has more than one cause.

To be precise it's the kind of situation where the bug can be triggered by 2, 3 or more independent causes, i.e. each cause is sufficient on its own to cause the bug.

Often when attempting to solve a bug like that I'll find one likely cause, address it but because the bug persists, I end up undoing the fix for cause #1, then finding cause #2 and ping ponging between them till I realise that I have to address multiple causes to make the bug go away.


Looking at this (thanks anoncept): http://psas.scripts.mit.edu/home/get_file.php?name=STPA_hand...

I'm not sure it takes us all that much further (for general purposes) than John Stuart Mill writing on causation in the nineteenth century - he did very well creating a philosophical foundation for causation-talk. (And the STPA handbook is excellent, as well.) In any case, Mill is the original source for the modern understanding plural and complex causation.


This article feels very straw man to me.

First, the 5y I've learned and run in the last 10 years takes pains to identify a fix or mitigation at each step - not just the root.

In fact for small failures and big costs, the root cause is deliberately not worked on, because analysis says the finer grain fixes are better/cheaper.

Second, root cause trees have been quite common, because there's almost never just one chain, especially when you're running a system that has plugged all the easy holes and fixed all the obvious first level problems.

Straw man article IMO, but I can't figure out for what purpose.


Not a fan of this term "the root cause". It appears to be causing some semantic trouble for the author as well. If you never get it in your head that you're going to find the "root cause" of something (initial conditions of the universe? lol), then you won't be looking for it. You'll be looking for the most promising point of intervention, which is what we do.


The examples in this article are all straw men; where's the compelling specific case where 5 whys lead to some demonstrable mistake?

Nothing to see here.


Whilst I agree with the article being that context matters a lot. But there are a lot of great things about RCAs.

A few of the things:

1. It shows visibility about the engineer's abilities

2. It attempts to show weaknesses of technology choices that happen above the people who support it

3. It attempts to show a weakness in the process. (Similar to a retro) Yet, in practice, rarely is this addressed.


Stopping before reaching the process is a big mistake many do when doing RCA. For example: Why did program crash: [Divide by zero]. Why? [Loop iterating array not terminated] Why? [Input arguments not validated] Why? [Developer was "sloppy"] and then they stop there.

The answer to the last question should instead be on or many of: no unit tests, no code reviews, working overtime, no QA before shipping, can't concentrate in open landscape, compiler warnings disabled, using too low-level programming language for high level logic, developers not educated on current tech stack, and so on. With follow up whys on all of those.


The problem usually is not finding and fixing the causes but that you could have spotted bugs earlier when the damage would have been smaller. People will always make mistakes, but catching as many as possible and as early as possible lessens the consequences


This article makes me angry. I think author has some useful things to say (like, there's usually not a single root cause) but the tone is super arrogant, most of the arguments start with strawmen.


Ultimately, the root cause is always "you fucked up." At least that's what people always seem to be pushing towards when they start dragging out terms like root cause.


I recently had to do a five-whys write up after typo-ing a config file. I'm pretty sure it was some kind of shaming exercise.


Chains are chains are chains. Can you construct a sail?


(2011)




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: