My friend Nick was a quality manager at a company that is respected for its superior software development and quality practices. Nick once said to me, “We only found two major defects in our latest code inspection, but we expected to find between four and six. We’re trying to figure out what’s going on. Did we miss some, or was this code particularly clean for some reason?”
Few organizations can make such precise statements about the quality of their software products. Nick’s organization has stable, repeatable development processes in place, and they’ve accumulated inspection data for several years. Analyzing historical data lets Nick predict the likely defect density in a given deliverable. When a specific inspection’s results depart significantly from the norm, Nick can ask probing questions to understand why. Did the inspectors prepare at the optimum rate, based on the organization’s experience? Did they use suitable analysis techniques? Were they adequately trained and experienced in inspection? Was the author more or less experienced than average? Was the product more or less complex than average? You can’t reach this depth of understanding without data.
This is the first in a series of three articles, adapted from my book Peer Reviews in Software (Addison-Wesley, 2002), that describe how to collect, analyze and interpret metrics from your peer reviews. It doesn’t take as much effort as you might fear to gather and use this kind of data. It’s more a matter of establishing a bit of infrastructure to store the data, and then making it a habit for review participants to record just a few numbers from each review experience. In fact, I think that peer review metrics provide an easy way to begin growing a measurement culture in your organization.
Why Collect Data?
Recording data about the review process and product quality is a distinguishing characteristic of formal peer reviews, such as the type of rigorous peer reviews called inspections. Data answers important questions, provides quantifiable insights and historical perspective, and lets you base decisions on facts instead of perceptions, memories or opinions.
For example, one organization learned that it could inspect requirements specifications written by experienced business analysts twice as fast as those written by novices because they contained fewer defects. This data revealed the need to train and mentor novice BAs. Another organization improved its development process by studying data on defect injection rates and the types of defects their inspections did not catch. This example illustrates the value of recording the life-cycle activities during which each defect is created and discovered.
One way to choose appropriate metrics is the Goal-Question-Metric, or GQM, technique. First, state your business or technical goals. Next, identify questions you need to answer to tell if you are reaching those goals. Finally, select metrics that will let you answer those questions. One goal might be to reduce your rework costs through peer reviews. Answers to the following questions could help you judge whether you’re reaching that worthy goal:
- What percentage of each project’s development effort is spent on rework?
- How much effort do our reviews consume? How much do they save?
- How many defects do we discover by review? What kind? How severe? At what life-cycle stage?
- What percentage of the defects in our products do our reviews remove?
- Do we spend less time testing, debugging, and maintaining products that we reviewed than those we did not?
Figure 1 (below) shows a progression of benefits that peer review measurements can provide to your organization. The individual base metrics you begin collecting from each review don’t tell you much by themselves. Tracking some derived metrics calculated from those base metrics—often as simple sums or ratios—reveals averages and trends of your team’s preparation and inspection rates, defect densities, inspection effectiveness, and other parameters.
These trends help you detect anomalous inspection results, provided that your development and inspection processes are consistent. They also help you estimate how many additional defects you can expect to find in the remaining life-cycle phases or in operation. Correlations between pairs of metrics help you understand, say, how increasing the preparation rate affects inspection efficiency. For the maximum rewards, use defect causal analysis and statistical process control to guide improvements in your development processes and quality management methods. It takes a while to reach this stage of peer review measurement sophistication.
Figure 1. The value of various peer review measurement analyses
Don’t make your measurement process so elaborate that it inhibits the reviews themselves. Establishing a peer review culture and finding defects is more important than meticulously recording masses of data. One of my groups routinely conducted various types of peer reviews, recorded numerous base metrics, and tracked several derived metrics. The entire team recognized the value we obtained from the reviews; hence, they became an ingrained component of our software engineering culture.
Some Measurement Caveats
Software measurement is a sensitive subject. It’s important to be honest and nonjudgmental about metrics. Data is neither good nor bad, so a manager must neither reward nor punish individuals for their metrics results. The first time a team member is penalized for some data he reported is the last time that person will submit accurate data. Defects found prior to peer review should remain private to the author. Information about defects found in a specific peer review should be shared only with the project team, not with its managers. You can aggregate the data from multiple reviews to monitor averages and trends in your peer review process without compromising the privacy of individual authors. The project manager should share aggregated data with the rest of the team so they see the insights the data can provide and recognize the peer review benefits.
Beware the phenomenon known as measurement dysfunction. Measurement dysfunction arises when the measurement process or the ways in which managers use the data lead to counterproductive behaviors by the people providing the data. People behave in the ways for which they are rewarded; they usually avoid behaviors that could have unpleasant consequences. Some forms of measurement dysfunction that can arise from peer reviews are: inflating or deflating defect severities; marking as closed defects that really aren’t resolved; and distorting defect densities, preparation times, and defect discovery rates to look more favorable.
There’s a natural tension between a work product author’s desire to create defect-free products and the reviewers’ desire to find lots of bugs. Evaluating either authors or reviewers according to the number of defects found during a review will lead to conflict. If you rate reviewers based on how many defects they find, they’ll report many defects, even if it means arguing with the author about whether every small issue truly is a defect. It’s not necessary to know who identified each defect or to count how many each reviewer found. What is important is that all team members participate constructively in peer reviews. Help managers avoid the temptation to misuse the data for individual performance evaluation by not making individual defect data available to them.
It’s tempting to overanalyze the review data. Avoid trying to draw significant conclusions from data collected shortly after launching your peer review program. There’s a definite learning curve as software people beginning participating in systematic reviews and figure out how to do them effectively and constructively. If you begin tracking a new metric, give it time to stabilize and make sure you’re getting reliable data before jumping to any conclusions. The trends you observe are more significant than any single data point.
Now that we have a basic foundation of peer review metrics principles, the next two articles in the series will get into some specific metrics to track and how to analyze the data.
- Karl Wiegers Q&A – Metrics, Passarounds and Agility in Peer Review
- How Feline Behavior Mirrors Code Review Best Practices – A Must Read for Cat Lovers
- Some Bugs Cost Millions – Why Code Review is Worth More Than Gold