Follow @jack_ganssle

The logo for The Embedded Muse For novel ideas about building embedded systems (both hardware and firmware), join the 27,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype, no vendor PR. It takes just a few seconds (just enter your email, which is shared with absolutely no one) to subscribe.

By Jack Ganssle

Margin

Published 9/23/2005

The video made my jaw drop. Flight 292's nosegear was cocked sideways, twisted 90 degrees from its normal position. As the pilot began his approach I wanted to turn away but was glued to the screen in fascinated horror. The wheel touched down, smoked, burst into flame and the tire tore away, nothing but metal grinding along the runway.

Astonishingly, the strut held fast.

Those seconds summed up a lot of the nature of engineering. The strut held fast, for loads that far exceed anything the plane experiences in normal operation. Engineers designed enough margin into the system to handle almost unimaginable and unanticipated forces.

It also shows the human side of the mechanical world. This failure had apparently been experienced before by other Airbus 320s. Something went wrong with the system used by the air industry to eliminate known defects.

I'm struck by the difference between failures in mechanical systems and those in computer programs. Software is topsy-turvy. Mechanical engineers can beef up a strut to add margin, to handle unexpected loads. EEs specify components heftier than needed and wires that can take more current than anticipated. They handle surges with fuses and weak links.

In software if just one bit out of hundreds of millions is wrong the application completely crashes. Margin is difficult, perhaps impossible, to add. Exception handlers can serve as analogs to fuses, but they're notoriously hard to test and generally have a bug rate far higher than that of the application.

Worse, we write code with the assumption that everything will work and there won't be any unexpected inputs. So buffer overflows are rampant. This complacent attitude isn't exclusive to desktop developers; after a software error destroyed Ariane 5 the review board cited a culture that assumed software can't fail. If it works in test it will work forever.

A plane, bridge and dare I say levee must have a reliability vanishingly close to 100%. So mechanical engineers design a structure that takes 110% or 150% of expected loads.

Many software apps require just as much reliability. But we can't add margin, so must build code that's 99.999% correct or better.

Yet humans aren't good at perfection. In school a 90% is an "A". If our code earned an "A," a million line-of-code program would have 100,000 errors.

Software is inherently fragile. We can, and must, add great exception handlers and use the very best methods to produce correct code. But until we find a way to make code that is more robust than the environment it's in, the elusive goal of perfection is our only hope.

What do you think? How can we add design margin to code?