Disaster!
How are we gonna work with small CPUs?
 |
For novel ideas about building embedded systems (both hardware and firmware), join the 25,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype, no vendor PR. It takes just a few seconds (just enter your email, which is shared with absolutely no one) to subscribe. |
Several thousand years ago, wide eyed and na've, I showed
up for the first day of ENES 101. After an inspiring lecture informing us that
most of us EE-wannabes would flunk out the instructor flipped on a projector and
showed a film of the failure of the Takoma Narrows Bridge.
The bridge earned the nickname "Galloping Gertie"
from its rolling, undulating behavior. Motorists crossing the 2,800-foot center
span sometimes felt as though they were traveling on a giant roller coaster,
watching the cars ahead disappear completely for a few moments as if they had
been dropped into the trough of a large wave.
In 1940, just a few months after it opened, the
Takoma Narrows Bridge collapsed after putting on a dazzling performance captured
on film. A 40 mph wind fed the structure's tendency to vibrate, starting a
resonance mode that caused its collapse. See http://www.ferris.edu/htmls/academics/course.offerings/physbo/MultiM/bridge/bridge.htm
for MPEG clips and more information, or www.me.utexas.edu/~uer/papers/paper_jk.html
for a description of the forensic engineering that uncovered the root causes of
the failure.
Since civil engineers took this failure so seriously
it was, in a sense, a tremendous success for bridge construction. They
discovered the problem's cause; as a result, for the past 50 years no
suspension bridge has been erected until it passed wind tunnel tests.
Yet after this brief introduction to one engineering
disaster - one that we EEs could hardly relate to - no more disaster stories
emerged. Do only civil engineers suffer from catastrophic failures?
Well, no. Even embedded systems have their share of
debacles, some deadly, some expensive, and some merely embarrassing. Here's a
selection of some purely embedded disasters, presented with the hope that they
make it into the engineering lore so we all can learn from past problems.
The Patriot Missile
During the Persian Gulf crisis - the one in 1991 that is -
Patriot Missile batteries were widely hailed for their effectiveness is
destroying incoming Iraqi Scuds. Yet the successes were accompanied by failures.
The air fields and seaports of Dhahran were protected by
six Patriot batteries. Alpha battery was to protect the Dhahran air base. On
February 25, 1991, Alpha Battery had been in operation for over 100 consecutive
hours. That's the day an incoming Scud struck an Army barracks and killed 28
American soldiers.
The problem wasn't due to a difficult-to-intercept
target; rather, a latent bug in the embedded software, rather like a cancer,
slowly reduced the system's accuracy as time went by.
The Patriots maintained a "time since last boot"
timer in a single precision floating point number. Time, so critical to
navigation and thus to system accuracy, was computed from this number. Patriots
use a 100 msec timebase. Unhappily this 1/10 of a second number cannot be
exactly represented by a floating point number. With 24 bit precision, after
about 8 hours of operation enough error accumulated to degrade navigational
accuracy.
After 8 hours time drifted by about .0275 seconds.
Not much, but enough to yield a 55 meter error. The time error increased to a
third of a second after 100 hours of operation, equivalent to 687 meters of
targeting inaccuracy.
The problem was known and understood; the solution
sounds something like what we'd hear on a tech support hotline for a PC.
"Can't hit Scuds, huh? Try rebooting once in a while!" In fact,
operational procedure was to reboot at 8 hour intervals until fixed software
arrived.
The crew of Alpha Battery didn't get the reboot
message from tech support. After 100 hour on-line, it missed the Scud by half a
kilometer.
Therac 25
AECL, at the time a Canadian Crown Corporation, developed
the Therac-25 in the early 80s. It was designed to treat cancers by irradiating
the patient with protons or electrons at computer-controlled energy levels. The
instrument apparently had a number of design flaws, which resulted in operators
constantly being presented with cryptic error messages requiring system
restarts.
Over a two year period six patients received massive doses
of radiation from the eleven machines installed in the US and Canada. Each
incident had similar pathology - the operator would initiate treatment, but get
an error message indicating no dose had been supplied. Used to the machine's
quirky behavior, operators would press the "try again" button -
sometimes several times. In fact, software bugs were indeed dosing the patient
on each trial, with radiation levels sometimes 30 times higher than desired.
Investigation found that, if the operator entered an
incorrect setup value in a menu, and then edited in the correct value within 8
seconds of the initial mistake, the Therac-25's code would generate tens of
thousands of Rads of radiation, yet display "no dose given". Operators
assumed that nothing more than a momentary glitch occurred. Confident that the
"no dose given" display was accurate they'd hit the button and dump
thousands more Rads into the patient.
In fact, the Therac-25's code was apparently
not all that bad. The machine worked well most of the time. The software
failures were mostly due to dynamic issues resulting from running in a
multitasking environment. Each task ran nearly perfectly in isolation. It wasn't
until the tasks started running asynchronously and communicating that problems
surfaced. A race condition, initiated by the operator's quick editing of menu
items, passed bad data to various tasks.
A single programmer wrote all of the Therac's code.
The RTOS was a half-breed homebrew assembly language. No one ever inspected the
code. Oversight was all but non-existent. We know from bitter experience that
code - especially safety critical code - simply must be inspected for accuracy
(most studies conclude that code inspections are about 20 times more efficient
at detecting bugs than testing). Despite the Therac 25, despite the small
disasters we all experience, few of us even now use code inspections. When will
we learn from the past?
For more details about the Therac-25, see IEEE
Computer, July, 1993, pages 18-41 - An Investigation of the Therac-25 Accidents,
by Nancy Levenson and Clark Turner. It's a good story.
Ariane 5
The European consortium formed to build and run launch
vehicles has established an amazing record of success with their Ariane-series
of rockets. Yet, in 1996 the first flight of a new generation of Arianes failed
when 40 seconds after liftoff the launcher went off course, broke up and
exploded.
The staggering cost of all space vehicles means any
important failure is investigated before resuming flight tests. Such was the
case with Ariane 5. Six weeks after the failure the investigating Board
submitted a report that placed the firmware at fault.
A pair of identical redundant Inertial Reference
Systems (whose acronym SRI is derived from the French equivalent of the name)
determines rocket's attitude and position, transmitting this information to
the vehicle's computer. The primary SRI provides the data, while the secondary
is a hot standby, ready to immediately replace the primary in case of failure.
Within a few seconds of launch the primary SRI
experienced an overflow while converting a 64 bit floating point number to a 16
bit integer. An exception handler noted the problem and shut the unit down,
assuming the back SRI would take over. It did! but experienced exactly the
same overflow, causing unit 2 to shut down as well. With no attitude information
the Ariane inevitably broke up.
Interestingly, the SRI code was a modified version of
that used successfully on Ariane 4. Though most of the conversions performed by
the ADA code were protected by explicit checks to insure their validity, the
variable that caused the crash was not. The Ariane 4 could not cause such an
exception with this variable. Ariane 5's different flight characteristics
meant that the error was now indeed possible, but somehow this possibility was
overlooked during the code port.
The investigating Board also noted that the exception
handler should never have so cavalierly shut down the SRI. A automatic restart
("Hello, Ariane tech support. Trouble? Sure, just reboot that pesky SRI")
could have prevented the disaster.
I suppose there are a lot of lessons here. One is to
never assume that a language selection will cure all runtime ills (some of us
expect too much from ADA). Another is to be very wary of porting code; all
assumptions can be assumed to be invalid. Finally, error handling is profoundly
important, yet is all too often not considered deeply.
The full text of the Board's report is at
www.cert.fr/francais/deri/adele/DOCUMENTS/ariane5.html.
Shuttle Simulator
The second Space Shuttle launch in 1981 was delayed by a
month when a fuel spill loosened a number of its tiles. The crew booked
simulator time in Houston to practice a number of scenarios, including a
"Transatlantic Abort", where the vehicle can neither make orbit nor can
return to the launch site. Suddenly, all four main Shuttle computers (in the
simulator) crashed. Part of the abort scenario requires fuel dumps to lighten
the spacecraft. It was during the second of these dumps that the crash occurred.
After a couple of days of analysis, programmers
discovered that the fuel management module, which had done one dump and
successfully exited, when recalled for the second fuel dump, restarted thinking
that this was it's first incarnation. Counters in the code had not been
properly reinitialized, though, causing what was essentially a computed GOTO to
branch out of the code, into a random section of memory.
A simple fix took care of the problem. More
interesting, though, was the realization that this same sort of bug had appeared
in different modules before. The programmers were committed to producing only
the best code, so decided to see if they could come up with a systematic way to
eliminate these generic sorts of bugs in the future.
They developed a list of seven questions that had a
high probability of isolating similar problems. A random group of programmers
then applied these questions to the bad fuel dumping module (to see if they
would indeed pick up the known problem), as well as other modules.
The questions detected every one of the bugs. 17
additional, previously unknown, problems surfaced!
This is software engineering at its best. Instead of
quickly fixing the bug and moving on - as most of us do - the development team
transformed the defect into an opportunity to improve the code.
High Tech Toilets
A product that works perfectly may also qualify as a
disaster.
My definition of quality of a software product is
"The quality of any product is exactly what the customer says it is." Just
as no-fault divorce and no-fault car insurance deals with the complexities of
"I did it/you did it" by removing blame from the issue, we too must
accept a no-fault definition of quality. The company will fail if the customer
is unhappy. It matters little if marketing did a lousy job of specifying
requirements or engineering did not meet those specs. All of us are responsible
for delivering something that thrills the customer.
Toto, a Japanese toilet company, has created an
entire line of smart toilets using embedded intelligence to outperform anything
else sculpted in porcelain. The Toto Model 6, a bargain at a few thousand bucks,
includes a bidet, a paper-less cleansing system, and a bottom-dryer. An
automatic lid opener/closer even saves marriages by flipping the seat down when
a man is done.
Perhaps this is a case of too much technology. Maybe
it's simply a poor user interface. Regardless, it seems the
language-independent icon on each button is not enough to tell many users how to
perform certain basic functions. Users frantically trying to simply flush
somehow activate the bidet, spraying the washroom. Confuse the up/down button
with the one for a, ah, "virtual wipe", and you might get knocked onto the
floor.
Words hardly do justice to the control panel. See a
picture of it at www.theimageworks.com/toilet/toilet6.htm.
When customers cannot use your product, or when too
much training is required, the product is a disaster.
(In case you're wondering, an optional remote
control is indeed available.)
Conclusion
As big processors, fancy operating systems, and even
GUIs become more commonplace it's ever harder to draw simple
distinctions between embedded and non-embedded systems. Hey, we know what an
embedded system is when we see it; it's just getting awfully hard to define
what makes embedded unique.
Perhaps a fundamental quality of embedded systems is quality.
Desktop applications that crash are a daily part of the fabric of life, so much
so we scarcely think about the regular system reboots needed to keep our
computers happy. Yet most embedded systems simply cannot crash.
The failures described above are to larger or lesser
degrees disaster stories that should instill the same fear in us that the Takoma
Narrows bridge still does to civil engineers. Smaller disasters, though are just
as important to us. The toaster oven that catches fire, the car computer that
resets from time to time, a credit card processing machine that loses
transactions - all are intolerable to our customers.
We have a choice - learn and profit from these sorts
of stories, or be doomed to repeat them ourselves.
|