Hacker News new | past | comments | ask | show | jobs | submit login
There is No Root Cause (kitchensoap.com)
77 points by themcgruff on Feb 10, 2012 | hide | past | favorite | 19 comments



The points here seem valid, but I think the author misunderstands 5 Whys analysis a bit. From http://en.wikipedia.org/wiki/5_Whys

"It is interesting to note that the last answer points to a process. This is one of the most important aspects in the 5 Why approach - the real root cause should point toward a process that is not working well or does not exist. Untrained facilitators will often observe that answers seem to point towards classical answers such as not enough time, not enough investments, or not enough manpower. These answers may sometimes be true but in most cases they lead to answers out of our control. Therefore, instead of asking the question why?, ask why did the process fail?"

and

"These tools allow for analysis to be branched in order to provide multiple root causes"


I was a bit puzzled reading the article, because in the places I have worked, 'root cause analysis' isn't looking for the simplistic thing the article describes. If it is simple, great, but more often than not the root cause is poor interaction between things, which can require multiple changes or even under-the-bonnet refactoring. The article's description of root cause analysis sounds like something a first-year undergraduate would think.


I'm getting the impression that that HBR video has had a much wider penetration than the 5 Why's concept as a whole. This seems to have lead a lot of people to draw the wrong conclusions about even what 5 Why's is really about, let alone how it works.


hrmmm... but not multiple causes that are actually fine on their own, or not in this particular order, or on that one day when that other (unknown) condition existed? I've been in many of tense meetings over those kind


During stressful times (like outages) people involved with response, troubleshooting, and recovery also often mis-remember the events as they happened, sometimes unconsciously neglecting critical facts and timings of observations, assumptions, etc.

This is a great reason to use IRC or some other loggable, text-based medium. After everything's green and you're doing the failure analysis, you can look at the logs to see what happened and when. IRC also makes it easier to collaborate if things are broken in the middle of the night when everyone is at home.


Why we have central monitoring systems that aggregate system data together. Makes trend analysis and correlation cross systems very useful. Unfortunately its a home grown solution, not sure what's available off the shelf.


Great post. A keeper.

We see this same type of thinking when dealing with all sorts of other complex, multi-dimensional systems. Economics, for instance. There are a huge number of people who think economies work in this same linear fashion. Or managing large groups of developers.

In some systems that are becoming more complex, such as avionics systems for large airplanes, it's getting to the point where the old root cause analysis methodology is still being used although it's getting less and less applicable (I'm speaking specifically of the crash over the ocean of the Air France flight, but there are other examples)

Our minds desperately want to live in a world with clear causality. Do X and Y will happen. When the world doesn't live up to our expectations, many times we just get out a bigger hammer.

Looking at this problem solely from a philosophical standpoint, it looks like there is a powerful argument in place for tiered systems to have some sort of distributed goal-seeking self-programming (machine learning), especially when dealing with large numbers of identically programmed/configured computers. That way the same combination of obscure causes wouldn't have such a disastrous multiplicative effect. Would be cool to chase that down further sometime.


The real world still has clear causality. Just because the links of causes and effects (and the involved feedback processes) are often too complex for most people to follow does not mean that they are not there. You might as well say that because most people can't do calculus, calculus isn't really useful.


Yeah this was not a keeper for me. "Causality is complicated..." yeah and if a butterfly flaps it's wings... blah... blah... blah...

This offers no insight as to how to improve process when failure occurs. Sure dogmatic application of root cause analysis is foolish, but the same obvious conclusion could be reached about dogmatically applying any type of management principle or analytic process. Failure to think outside of the box is a failure too.

What is annoying is that the author suggests that people should be skeptical of root cause analysis and 5 whys, without offering anything concrete as an alternative approach.

Every management technique is simply a practice towards further learning, but failing to practice anything simply because you can find flaws in everything accomplishes exactly nothing.


I think you missed the point here.

It's not that causality doesn't exist, or you shouldn't use the 5-whys, it's that we have a desire to focus on single causes instead of multiple ones, and understand systems and simple cases of cause-and-effect.

Systems are coded and tested for common paths. Extremely rare circumstances take systems down pathways engineers may have not planned for. When you layer systems, combinations of rare situations can cause "storms" that take everything down. There was a great article on AWS on HN a few months ago that made a point that bears repeating: they are doing something with PaaS at that scale that is quantitatively different than what's been done before.

But nobody is saying to give up causal analysis, only to recognize the limits of RCA when your causation tree can span into hundreds of nodes. The interesting thing here is that the human application of the analysis tool has it's own features that's just becoming apparent. Analysis tools have human environments they live in. Once you acknowledge that a a separate issue, there's no reason you can't continue to use those tools (and others) to work through the problems.


I love this article. I've seen this firsthand, having been the "fall guy" for an outage that nobody really understood. There were many causes for the outage, including poor software design, but the entirety of blame and punishment landed at the convenient "single place where a human touched something".


I agree - systems fail for many reasons at once. People like root cause analysis because it allows everyone to point the finger at someone else.

Of course the flip side is equally awful. When folk say "that's just how it works around here" you know you are doomed.


systems fail for many reasons at once.

Disagree.

Now, I will allow that very complex systems can mask that root cause.

One might _never_ be able to find out the root 'why' for a number of reasons: lack of time, inability to see into the black box where the failure happened. Perhaps everyone is dead, the data you need destroyed, the widget is lost under the ocean.

And that sometimes the root cause failure isn't a hard technical thing but something squishy like 'we failed to budget for disk space' or 'the CIO insisted we do it that way'.

But there is _always_ a root cause.


Blamestorming is not root cause analysis.


So rootcause is not A, but A && (B || C) && D.

It is obvious to any programmer, that bug can occur depending on many factorshappening at once, or in the "right" order, years of bad data accumulating, etc. That does not destroy causality, nor make analysis useless.


Of all the really out-of-nowhere/longer-lasting outages I've been hit by a clear pattern emerged that its always the "alley-oop" or "one-two punch combo" issues that really get you.


I get where he's coming from but for most of these situation it's the complexity that's the root cause and that's where you end up on your 5th why.


I agree with the point, but disagree with how the word "trigger" is being used here.

An identifiable "trigger" is usually present, but it's not the root cause, it's the last condition to be satisfied, the final event that "triggered" the Rube Goldberg machine that brought the system down.

Identifying that trigger can be valuable, because it sometimes points to a clear design problem, like an error in assumptions about how the system is used.


Sorites paradox; in complex systems - especially ones with feedback - knowing the tipping point trigger isn't very useful.

Traffic jams can be caused by density of traffic. A slight variation in speeds cascades through human delay in reaction, causing a longitudinal wave to ripple backwards through the traffic. If the amplitude of the wave gets high enough, sections of traffic will periodically reach a standstill while the wave works its way through.

There's no meaningful trigger here. Decomposing the whole into parts won't solve the problem. If you didn't know that traffic density causes jams, looking into the root cause would seem mysterious, because the chaotic behaviour that gives rise to the initial perturbation is essentially unimportant. It's the interaction between the parts that matters, not the "trigger".

And even when you've "solved" this by building more and wider roads to spread the density, you find a different level of homeostasis; better transport infrastructure like roads encourages people to take more journeys, live further apart with more space, further away from work and play, leading to more traffic again.

Sometimes, when solving a problem, looking inwards, to parts, to triggers, to root causes, isn't the right approach; looking outwards, to the holistic whole, running experiments and simulations, creating new theories, is better. But this is a synthetic approach, not an analytic one driven by 5 Whys.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: