By Jack Ganssle

Margin

Published 9/23/2005

The video made my jaw drop. Flight 292's nosegear was cocked sideways, twisted 90 degrees from its normal position. As the pilot began his approach I wanted to turn away but was glued to the screen in fascinated horror. The wheel touched down, smoked, burst into flame and the tire tore away, nothing but metal grinding along the runway.

Astonishingly, the strut held fast.

It also shows the human side of the mechanical world. This failure had apparently been experienced before by other Airbus 320s. Something went wrong with the system used by the air industry to eliminate known defects.

I'm struck by the difference between failures in mechanical systems and those in computer programs. Software is topsy-turvy. Mechanical engineers can beef up a strut to add margin, to handle unexpected loads. EEs specify components heftier than needed and wires that can take more current than anticipated. They handle surges with fuses and weak links.

In software if just one bit out of hundreds of millions is wrong the application completely crashes. Margin is difficult, perhaps impossible, to add. Exception handlers can serve as analogs to fuses, but they're notoriously hard to test and generally have a bug rate far higher than that of the application.

Worse, we write code with the assumption that everything will work and there won't be any unexpected inputs. So buffer overflows are rampant. This complacent attitude isn't exclusive to desktop developers; after a software error destroyed Ariane 5 the review board cited a culture that assumed software can't fail. If it works in test it will work forever.

A plane, bridge and dare I say levee must have a reliability vanishingly close to 100%. So mechanical engineers design a structure that takes 110% or 150% of expected loads.

Many software apps require just as much reliability. But we can't add margin, so must build code that's 99.999% correct or better.

Yet humans aren't good at perfection. In school a 90% is an "A". If our code earned an "A," a million line-of-code program would have 100,000 errors.

Software is inherently fragile. We can, and must, add great exception handlers and use the very best methods to produce correct code. But until we find a way to make code that is more robust than the environment it's in, the elusive goal of perfection is our only hope.

What do you think? How can we add design margin to code?