Published in Embedded Systems Programming December 2004
||For novel ideas about building embedded systems (both hardware and firmware), join the 25,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype, no vendor PR. It takes just a few seconds (just enter your email, which is shared with absolutely no one) to subscribe.
To gear heads like me the history of engineering is rich in
stories and lore, of failings and successes, and of triumphs and defeats of
individual engineers. I remember reading Michener's The Source in high
school and being entranced by his description of how the engineers of Megiddo,
near Jerusalem, dug a tunnel 210 feet long some 2900 years ago. The city was
under siege and its well was located outside the city walls. With uncanny skill
they bored under the walls, secretly, navigating with only the crudest of
instruments, yet somehow targeting the narrow fount perfectly.
Even the humblest of artifacts of technology have
fascinating stories. Friends still make fun of my reading Henry Petroski's 400
page book titled The Pencil. Yet even this simplest of all writing
devices sports a complex and fascinating history, one of engineers and artisans
optimizing materials and designs to give users an efficient writing instrument.
Then there's James Chiles' Inviting Disaster, a
page-turner of engineering failures, from bridge collapses, airline crashes,
offshore oil platform sinkings, to, horrifyingly, near nuke exchanges. Strangely
Chiles doesn't describe the famous loss of the Tacoma Narrows Bridge, which
succumbed to wind-induced torsional flutter. The bridge earned the nickname
"Galloping Gertie" from its rolling, undulating behavior. Motorists crossing the
2,800-foot center span sometimes felt as though they were traveling on a giant
roller coaster, watching the cars ahead disappear completely for a few moments
as if they had been dropped into the trough of a large wave.
Failures can be successes. When an aircraft goes down the
NTSB sends investigators to determine the cause of the accident. Changes are
made to the plane's design, maintenance, or training procedures. This healthy
feedback loop constantly improves the safety of air travel, to the point where
now it's less dangerous to fly than walk. That's plenty strange when you
consider the complexity of such a machine. 400,000 pounds of aluminum traveling
at 600 knots 40,000 feet up, in air that's 60 below zero, with turbines rotating
at 10,000 RPM. It's astonishing the thing works at all.
Yet the concept of applying feedback, lessons learned, is
relatively new. Those behind the Tacoma Narrows Bridge certainly ignored all of
the lessons of bridge-building.
Clark Eldridge, the State Highway Department's lead
engineer for the project, developed the bridge's original design. But federal
authorities footed 45% of the bill and required Washington State to hire an
outside, and more prominent, consultant. Leon Moisseiff promised that his design
would cut the bridge's estimated cost in half.
Similar structures built around the same time were
expensive. At $59 million and $35 million respectively, the George Washington
and Golden Gate bridges had a span similar to that of the Tacoma Narrows.
Moisseiff's new design cost a bit over $6m, clearly a huge savings.
Except it fell down 4 months after opening day.
Moisseiff and others claimed that the wind-induced
torsional flutter which led to the collapse was a new phenomenon, one never seen
in civil engineering before. They seem to have forgotten the Dryburgh Abbey
Bridge in Scotland which collapsed in 1818 for the same reason. Or the 1850
failure of the Basse-Chaine Bridge, a similar loss in 1854 of the Wheeling
Suspension Bridge, and many others. All due to torsional flutter.
Then there was the 1939 Bronx-Whitestone Bridge, a sister
design to Tacoma Narrows, which suffered the same problem but was stiffened by
plate girders before a collapse.
And who designed the Bronx-Whitestone? Leon Moisseiff.
Lessons had been learned, but criminally forgotten. Today
the legacy of the Tacoma Narrows failure lives on in regulations which require
all federally-funded bridges to pass wind tunnel tests designed to detect
In the firmware world we, too, have our share of disasters.
Most were underreported, few developers understand the proximate causes and the
lessons that need to be learned. The history of embedded failures shows patterns
we should - must! - identify and eliminate.
Consider the Mars Polar Lander, a 1999 triple failure. The
MPL's goal was to deliver a lander on Mars for half the cost of the cost of the
spectacularly successful Pathfinder mission launched two years earlier. At $265
million Pathfinder itself was much cheaper than earlier planetary spacecraft.
Shortly before it began its descent, the spacecraft
released twin Deep Space 2 probes which were supposed to impact the planet's
surface at some 400 MPH and return sub-strata data.
MPL crashed catastrophically. Neither DS2 probe transmitted
even a squeak.
The investigation board made the not-terribly-earth-shaking
observation that tired people make mistakes. The contractor used excessive
overtime to meet an ambitious schedule. Mars is tough on schedules. Slip
by just one day past the end of the launch window and the mission must idle for
two years. In some businesses we can dicker with the boss over the due date, but
you just can't negotiate with planetary geometries.
MPL workers averaged 60 to 80 hours per week for extended
periods of time.
The board cited poor testing. Analysis and modeling
substituted for test and validation. There's nothing wrong with analysis, but
testing is like double-entry bookkeeping - it finds modeling errors and other
strange behavior never anticipated when the product exists only as ethereal
NASA's mantra is to test like you fly, fly what you
tested. Yet no impact test of a running, powered, DS2 system ever occurred.
Though planned, these were deleted midway through the project due to schedule
considerations. Two possible reasons were found for Deep Space 2's twin flops:
electronics failure in the high-g impact, and ionization around the antenna
after the impacts. Strangely, the antenna was never tested in a simulation of
Mar's 6 torr atmosphere.
While the DS2 probes were slamming into the Red Planet
things weren't going much better on MPL. The investigation board believes the
landing legs deployed when the spacecraft was 1500 meters high, as designed.
Three sensors, one per leg, signal a successful touchdown, causing the code to
turn the descent engine off. Engineers knew that when the legs deployed these
sensors could experience a transient, giving a false "down" reading! but somehow
forgot to inform the firmware people. The glitch was latched; at 40 meters
altitude the code started looking at the data, saw the false readings, and
faithfully switched off the engine.
A pre-launch system test failed to detect the problem
because the sensors were miswired. After correcting the wiring error the test
was never repeated.
Then there's the twin Mars Expedition Rovers, Spirit and
Opportunity, which at this writing have surpassed all mission goals and continue
to function. We all heard about Spirit's dispiriting shutdown when it tried to
grind a rock. Most of us know that the flash file system directory structure was
full. VxWorks tossed an exception, exactly as it should have and tried to
reboot. But that required more directory space, causing another exception,
another reboot, repeating forever.
Just as in unlamented DOS deleted files still consumed
directory space. A lot of old files accumulated on the coast phase to Mars still
Originally planned as a 90 day mission, the spacecraft were
never tested for more than 9 days. In-flight operation of motors and actuators
generated far more files than ever seen during the ground tests. The
investigators wrote: "Although there was limited long duration testing whose
purpose was to identify system memory consumption of this type, no problems were
detected because the system was not exercised in the same way that it would
later be used in flight."
Test like you fly, fly what you tested.
Exception handlers were poorly implemented. They suspended
critical tasks after a memory allocation failure instead of placing the system
in a low-functionality safe mode.
A source at NASA tells me the same VxWorks memory
allocation failure has caused software crashes on at least 6 other missions. The
OS isn't at fault, but it is a big and complex chunk of code. In all cases the
engineers used VxWorks incorrectly. We seem unable to learn from other people's
disasters. We're allowed to make a mistake - once. Repeating the same mistake
over and over is a form of insanity.
It's easy to blame the engineers, but they diagnosed this
difficult problem using a debugger 100 million miles away from the target
system, found the problem, and uploaded a fix. Those folk rock.
In 1999 a Titan IVb (this is a really big rocket)
blasted off the pad, bound to geosynchronous orbit with a military
communications satellite aboard. Nine minutes into the flight the first stage
shut down and separated properly. The Centaur second stage ignited and
experienced instabilities about the roll axis. That coupled into both yaw and
pitch deviations until the vehicle tumbled. Computers compensated by firing the
reaction control system thrusters! till they ran out of fuel. The Milstar
spacecraft wound up in a useless low elliptical orbit.
A number of crucial constants specified the launcher's
flight behavior. That file wasn't managed by a version control system! and was
lost. An engineer modified a similar file to recreate the data but entered one
parameter as -0.1992476 instead of the correct -1.992476. That was it - that one
little slipup cost taxpayers a billion dollars. At least there's plenty more
money where that came from.
We all know to protect important files with a VCS - right?
Astonishingly, in 1999 a disgruntled programmer left the FAA, deleting all of
the software needed for on-route control of planes between Chicago O' Hare and
the regional airports. He encrypted it on his home computer. The feds busted
him, of course, but FBI forensics took 6 months to decrypt the key.
Everyone makes mistakes, but no one on the Centaur program
checked the engineer's work. For nearly 30 years we've known that inspections
and design reviews are the most powerful techniques known to prevent errors.
The constant file was never exercised in the inertial
navigation system testbed, which had been specifically designed for tests using
real flight data.
Test like you fly, fly what you tested.
A year later Sea Launch (check out the cool pictures of
their ship-borne launch pad at www.sea-launch.com) lost the $100 million ICO F-1
spacecraft when the second stage shut down prematurely.
The ground control software had been modified to
accommodate a slight change in requirements. One line of code, a conditional
meant to close a valve just prior to launch, was somehow deleted. As a result
all of the helium used to pressurize the second stage's fuel tanks leaked out.
Pre-flight tests missed the error.
Test like you fly, fly what you tested.
This failure illustrates the intractability of software.
During countdown, ground software monitored some 10,000 sensors, issuing over a
million commands to the vehicle. Only one was incorrect, a 99.9999% success
rate. In school a 90 is an A. Motorola's famed six sigma quality program
eliminates all but 3.4 defects per million. Yet even 99.9999% isn't good enough
for computer programs.
Software isn't like a bridge, where margins can be added by
using a thicker beam. One bit wrong out of hundreds of millions can be enough to
cause total system collapse. Margin comes from changing the structure in
sometimes difficult ways, like using redundant computers with different code. In
Sea Launch's case, perhaps a line or two of C that monitored the position of the
valve would have made sense.
Robert Glass in his Facts and Fallacies of Software
Engineering (Addison-Wesley, 2002, ISBN 0321117425) estimates that for each
25% increase in requirements the code's complexity explodes by 100%. The number
of required tests probably increases at about the same rate. Yet testing is
nearly always left till the end of the project, when the schedule is at max
stress. The boss is shrieking "ship it! Ship it!" while the spouse is wondering
if you'll ever come home again.
The tests get shortchanged. Disaster follows.
The higher levels of the FAA's DO-178B safety critical
standard require code and branch coverage tests on every single line of code and
each conditional. The expense is staggering, but even those ruthless procedures
aren't enough to guarantee perfection.
Next month we'll look at some more failures and draw