api-monitoring

Digging Down to the Root Cause

digging-in-root-cause-analysis

When things are going well inside a software project, there is a grace and harmony about it. One part does one thing well, allowing another part to do what it must do without unnecessary friction or waste of effort. Like a mechanical clock, the parts working together reinforce each other’s efforts.

But, note that I said “when.” It’s far more likely that things will not function together as simply as one hopes when designing something. In fact, it’s more usual that there will be a failure when you fire something complex up for the first time than there will be correct functioning.

While it may be emotionally satisfying for a manager to “point a finger”at someone (or something) and blame them for a failure, doing so without trying to fix the core issues at hand likely won’t solve your problem. The root cause of whatever caused a situation has to be found, not just doing the easy blaming that can go on within a team.

Analysis of the root cause of a problem is a discipline, not a witch hunt. It should never be used without understanding this as the necessary baseline. To do anything else ensures the overall team involved will subvert or divert the entire analytical effort out of simple self-preservation.

A Root Cause Diagram

In 1968, Kaoru Ishikawa came up with the diagram that bears his name while he was working in the Kawasaki shipyards. This tool became widely used in many differing industries besides manufacturing since it gives a simple visual concept to link the interprocess relationships existing in causal factors.

The diagram looks like a fish’s bone skeleton viewed from the side, giving it the “fishbone diagram”name. Groups of causal factors (all of which must be validated as being involved in the specific problem) are arranged around a central line. Sub-causes are linked to the major causes in the same grouping.

The “Fishikawa”diagram is more a roadmap than a hierarchy. Each of the causal factors flows into another, without an inherent ranking of the major effect or minor effect causalities. The problem as a whole is what the diagram works with, and the objective is to have all of the causal factors delineated on it.

RCA at Work

Root Cause Analysis (RCA)  has a goal of preventing reoccurrence of some problem by taking some corrective action after the fact. In order to take corrective action, the problem must be understood enough to clearly identify the causes of it.  And that is not as easy as it first seems.

The actual analysis is mostly determined by the problem’s area of operation. Engineering problems are analyzed with much different kinds of probable causes than a problem that exists in marketing. Both may use the Fishikawa diagram in the analysis, but the kinds of causes shown in the diagram will be much different.

But no matter the specific problem area, conclusions of the analysis have to be demonstrable by the evidence present. This is an evidence-based system here, not interpretive crystal-ball gazing.

There may be more than one root cause for a problem. Determining if this is the case can take much analytical effort on the team’s part, but it is worth the effort put in to the analysis. Identifying all the solutions possible from each of the root causes to a problem will help prevent recurrence of the problem at the lowest cost and in the simplest way. If there are alternatives that are equally effective, then the simplest or lowest cost approach is generally preferred.

Automating RCA

Inductive analysis is different than just working with facts after a failure. It’s an attempt to prevent problems for happening in the first place. Failure Mode and Effects Analysis (FMEA) is just such an attempt, by being forward-looking (inductive) rather than just being reactive in the RCA. Each component’s failure modes are analyzed as to what their effect will be on the entire system. FMEAs can be performed at the system, subsystem, assembly, subassembly or part level. The FMEA should be a dynamic and open document during development of a design. It should be scheduled and completed concurrently with the design.

This is a critical aspect of FMEA. If the analysis is done at the wrong time in the design cycle (after the software or hardware is locked down, for example) it is useless. Avoiding identified failures can only be done when they are eliminated or minimized through design modification at the earliest possible point in the development.

FMEA has found a home in software-based monitoring solutions, especially those that work with complex sites and networks. The raw, hard work of fault analysis can be mitigated with these kind of automated solutions. A fault will have to occur at least once for an automated RCA to know it exists (FMEA is inductive but not enough so to predict non-obvious fault patterns), but once that fault happens the automated RCA can be of great use in tracing where the causes of the problem originate.

RCA is a tool that must be used along with other specific tools in order to have a beneficial result. Many situations will require manual fault tree analysis, a top-down deductive approach. In FTA, there should be only one Top Event and all concerns must tree down from it. Then, each situation that could cause that effect is added to the tree as a series of logic expressions. If one can put probabilities of occurrence next to each part of the tree, FTA can be used by software to do fault probability analysis on large systems. Event trees are similar, but used more to assess consequences of failure along an event time line.

RCA is a designer’s friend; but is as useful to someone monitoring operations. Only by looking at what causes a fault can it ever be corrected.

See also: