Latent Defects

Summary: Some numbers illustrate the cost of ignoring quality.

Capers Jones' Software Quality in 2011: A Survey of the State of the Art is chock full of interesting data. One can only despair at some of it.

Like most researchers, Mr. Jones prefers to use function points to lines of code. But few of us in the embedded space really understand function points. Using the data in this report, mixing in some other data that he sent me privately, and converting it all to lines of C code, some patterns emerge.

First, some definitions: KLOC is thousands of non-comment source lines of code Removal efficiency, sometimes called defect removal efficiency, is the percentage of those bugs we fixed prior to shipping. CMMI is the Capability Maturity Model Integration. It's one of the better-known development processes, though only a couple of percent of embedded outfits use it. CMMI defines five levels of process maturity for an organization, ranging from level 1 (ad hoc/chaos, which is typical for the industry) to level 5.

First, this table shows typical numbers of bugs injected into projects and the removal efficiency versus CMMI level. The last column shows the number of delivered bugs per KLOC:

The good news is that embedded systems average a 95%, CMMI3, removal efficiency. However, bug injection rates are CMMI1-ish. So, the average number delivered runs around 2.5/KLOC. (Other data pegs bug injection rates in the embedded space between 50-100 per KLOC).

The following chart shows the number of delivered bugs versus project size for each of the CMMI levels:

Remember, these are defects we've delivered to the customer. A million line project done with chaos will surprise the user with 10K bugs.

But wait, there's more. Mr. Jones' data shows that about 1% of all of the delivered defects will result in catastrophic failure of the system. 20% of the bugs are in the next less severe category, major defects.

How many of those will show up in, say, the first year of operation? That is a function of the size of the project (a smaller percentage of the total bugs will appear in the first year on big projects than on little ones), and on how many users there are. A bigger user base means more of the latent defects pop up in year one. 12% of the bugs will appear in the first 12 months on a 100 KLOC product with a single user. That jumps to 90% given 10 million users.

So I asked this question: what is the probability the system will experience a catastrophic failure in the first year? To keep the graph simple I assumed the best possible case: just one user. In reality for most products the results are far worse. The unhappy results follow:

The moral: ad hoc approaches essentially guarantee your customer will experience a complete system failure in the first year. Now, the severity will vary; infusion pump makers have a lot more at stake than a company that builds electronic greeting cards.

Mr. Jones also publishes the cost to repair a defect found in the field, which is also a function of the system's size. He cites numbers from $132 to $938. I think those figures are very low considering the time needed to support the customer, work up a bug report, fix it, test and redeploy. But even using the $132 figure a project with 10K bugs is going to cost over a megabuck to support. Or, looking at this a different way, going just from CMMI1 to CMMI2 saves about $800k in support for a 1 MLOC project.

The CMMI is one of many disciplined development approaches, but there's little hard data for the others (with the exception of the Personal Software Process).

This is all data collected from projects and so is not necessarily predictive. However, it paints a broad picture, one we should not ignore.

Published May 23, 2012