Disaster Redux!
Those who forget history, are condemned to repeat it.
Published in Embedded Systems Programming
 |
For novel ideas about building embedded systems (both hardware and firmware), join the 25,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype, no vendor PR. It takes just a few seconds (just enter your email, which is shared with absolutely no one) to subscribe. |
Disaster Redux!
The spacecraft descended towards the planet, accelerating
in the high-g field as it drew nearer. Sophisticated electronics measured the
vehicle's position and environment with exquisite precision, waiting for just
the right moment to deploy the parachute.
Nothing happened. The mission crashed on the surface.
Last month I described how Mars Polar Lander was lost due
to a software error. But this scenario played out on yet another planet, the one
known as Earth, when on September 8 the Genesis mission impacted at 200 MPH. As
I write this the Mishap Investigation Board hasn't released a final report. But
they did say the gravity-sensing switches were installed upside down so couldn't
detect the Earth's gravitational field.
The origins of Murphy's Law are in some dispute. The best
research I can find (http://www.improb.com/airchives/paperair/volume9/v9i5/murphy/murphy1.html)
suggests that Captain Ed Murphy complained "If there's any way they can do it
wrong, they will" when he discovered that acceleration sensors on a rocket sled
were installed backwards. Nearly 60 years later the same sort of mistake doomed
Genesis.
Perhaps a corollary to Murphy's Law is George Santanya's
oft quoted "those who forget history are condemned to repeat it."
NASA's mantra "test like you fly, fly like you test" serves
as an inoculation of sorts against the Murphy virus. We don't as yet know why
Genesis' sensors were installed upside down, but a reasonable test regime would
have identified the flaw long before launch.
Last month I focused on high profile failures from the
space business. Few other industries are exempt from their share of firmware
disasters. Some are quite instructive.
Tumor Zappers
The Therac 25 was a radiotherapy instrument designed to
treat tumors with carefully regulated doses of radiation. Occasionally operators
found that when they pressed the "give the patient a dose" button the machine
made a loud clunking sound, and then illuminated the "no dose given" light.
Being normal human-type operators, they did what any normal human-type person
would do: press the "dose" button again. After a few iterations the patients
were screaming in agony.
Between 1985 and 1988 six cases of massive overdosing
resulted in three deaths.
The machines were all shut down during an investigation,
which found that if the backspace button was pressed within 8 seconds of the
"give the patient a dose" control being actuated, the device would give
full-bore max X-rays, cooking the patient.
Software killed.
The code used a homebrew RTOS riddled with timing errors.
Yet today, nearly two decades later, far too many of us continue to write our
own operating systems. This is despite the fact that at least a hundred are
available, for prices ranging from free to infinity, from royalty-free licenses
to ones probably written by pirates. Even those cool but brain-dead PIC
processors that have a max address space of a few hundred words have a $99 RTOS
available.
Developers give me lots of technical reasons why it's
impossible to use a commercial OS. Too big, too slow, wrong API - the reasons
are legion. And mostly pretty dumb. EEs have long used glue logic to make
incompatible parts compatible. They'd never consider building a custom UART just
because of some technical obstacle. Can you imagine going to your boss and
saying "this microprocessor is ideal for our application, except there are two
instructions we could do so much better. So I plan to build our own out of 10
million transistors." The boss would have you committed. Yet we software people
regularly do the same by building our own code from 10 million bits. Crafting a
custom OS is nearly always insane, and in the case of the Therac 25, criminal.
It's tough to pass information between tasks safely in a
multithreaded system, which is why a decent RTOS has sophisticated messaging
schemes. The homebrew version used in the Therac 25 didn't have such features so
global variables were used instead, another contributor to the disaster.
Globals are responsible for all of the evil in the
universe, from male pattern baldness to ozone depletion. Of course there are
instances where they're unavoidable, but those are rare. Too many of us use them
out of laziness. The OOP crowd chants "encapsulation, inheritance,
polymorphism;" the faster they can utter that mantra the closer they are to OOP
nirvana, it seems. Of the three, encapsulation is the most important. Both Java
and C++ support encapsulation! as do assembly and C. Hide the data, bind it to
just the routines that need it. Use drivers both for hardware and to access data
items.
The Therac's code was, as usual, a convoluted mess. Written
mostly by a solo developer, it was utterly unaudited. No code inspections had
been performed.
We've known since 1976 that inspections are the best way to
rid programs of bugs. Testing and debugging simply don't work; most test
regimens only exercise about half the code. It's quite difficult to create truly
comprehensive tests, and some features, like exception handlers, are nearly
impossible to invoke and exercise.
Decent inspections will find about 70% of a system's bugs
for a twentieth of the cost of debugging. The Therac's programmer couldn't be
bothered, which was a shame for those three dead patients.
But 1985 was a long time ago. These things just don't
happen anymore. Or, do they?
Dateline Panama, 2001. Another radiotherapy device, built
by a different company, zapped 28 patients. At least 8 died right after the
overexposures; another 15 either already have or are expected to die as a
result.
To protect the patient physicians put lead blocks around
the tumor. The operator draws the block configuration on the machine's screen
using a mouse. Developers apparently expected the users to draw each one
individually, though the manual didn't make that a requirement. 50 years of
software engineering has taught us that users will always do unexpected things.
Since the blocks encircled the tumor a number of doctors drew the entire
configuration in one smooth arcing line.
The code printed out a reasonable treatment plan yet in
fact delivered its maximum radiation dose.
Software continues to kill.
The FDA found the usual four horsemen of the software
apocalypse at fault: inadequate testing, poor requirements, no code inspections,
and no use of a defined software process.
Pacemaking
I bet you think pacemakers are immune from firmware
defects. Better think again.
In 1997 Guidant announced that one of their new pacemakers
occasionally drives the patient's heartbeat to 190 beats per minute. Now, I
don't know much about cardiovascular diseases, but suspect 190 BPM to be a
really bad thing for a person with a sick heart.
The company reassured the pacemaking public that there
wasn't really a problem; the code had been fixed and disks were being sent
across the country to doctors. However, the pacemaker is implanted
subcutaneously. There's no 'net connection, no USB port or PCMCIA slot.
Turns out that it's possible to hold an inductive loop over
the implanted pacemaker. A small coil in the device receives energy to charge
the battery. It's possible to modulate the signal and upload new code into
Flash. The robopatients were reprogrammed and no one was hurt.
The company was understandably reluctant to discuss the
problem so it's impossible to get much insight into the nature of what went
wrong. But clearly inadequate was testing.
Guidant is far from alone. A study in the August 15, 2001
Journal of the American Medical Association ("Recalls and Safety Alerts
Involving Pacemakers and Implantable Cardioverter-Defibrillator Generators")
showed that more than 500,000 implanted pacemakers and cardioverters were
recalled between 1990 and 2000. (This month's puzzler: how do you recall
one of these things?)
41% of those recalls were due to firmware problems. The
recall rate increased between in the second half of that decade over the first.
Firmware is getting worse. All five US vendors have an increasing recall rate.
The study said: "engineered (hardware) incidents [are]
predictable and therefore preventable, while system (firmware) incidents are
inevitable due to complex processes combining in unforeseeable ways."
Baloney.
It's true that the software embedded into these marvels has
grown steadily more complex over the years. But that's not an excuse for a
greater recall rate. We must build better firmware when the code base
grows. As the software content of the world increases a constant bug rate will
lead to the collapse of civilization. We do know how to build better
code. We chose not to. And that blows my mind.
Plutonium Perils
Remember Los Alamos? Before they were so busily engaged in
losing disks bulging with classified material this facility was charged with the
final assembly of the US's nuclear weapons. Most or all of that work has
stopped, reportedly, but the lab still runs experiments with plutonium.
In 1998 researchers were bringing two subcritical chunks of
plutonium together in a "criticality" experiment, which measured the rate of
change of neutron flux between the two halves. It would be a Real Bad Thing if
the two bits actually got quite close, so they were mounted on small
controllable cars, rather like a model railway. An operator uses a joystick to
cautiously nudge them towards each other.
The experiment proceeded normally for a time, the cars
moving at a snail's pace. Suddenly both picked up speed, careening towards each
other at full speed. No doubt with thoughts of a mushroom cloud in his head, the
operator hit the "shut down" button mounted on the joystick.
Nothing happened. The cars kept accelerating.
Finally actuating an emergency SCRAM control, his racing
heart (happily sans defective embedded pacemaker) slowed when the cars stopped
and moved apart.
The joystick had failed. A processor reading this device
recognized the problem and sent an error message, a question mark, to the main
controller. Unhappily, ? is ASCII 63, the largest number that fits in a 6 bit
field. The main CPU interpreted the message as a big number meaning go real
fast.
Two issues come to mind: the first is to test everything,
even exception handlers. The second is that error handling is intrinsically
difficult and must be designed carefully.
Patterns
The handful of disaster stories I've shared over the last
two columns have many common elements. On Mars Polar Lander and Deep Space 2,
the Mars Expedition Rover and Titan IVb, Sea Launch, pacemakers, Therac 25 and
in Los Alamos inadequate testing was a proximate cause. We know testing is hard.
Yet it's usually deferred till near the end of the project, so gets shortchanged
in favor of shipping now.
Tired programmers make mistakes. Well, duh. Mars Polar
Lander, Deep Space 2 and the Mars Expedition Rover were lost or compromised from
this well-known and preventable problem.
Crummy exception handlers were one of proximate causes of
problems with the Mars Expedition Rover, Los Alamos and plenty of other
disasters.
Had a defined software process, including decent
inspections, been used no one would have been killed by the Therac 25. I
estimate about 2% of embedded developers inspect all new code. Yet
properly-performed inspections are a silver bullet that accelerates the schedule
and yields far better code.
I have a large collection of embedded disasters. Some of
the stories are tragic; others enlightening, and some merely funny. What's
striking is that most of the failures stem from just a handful of causes.
Remember the Tacoma Narrows bridge failure I described last month? Leon
Moisseiff was unable to learn from his profession's series of bridge failures
from wind-induced torsional flutter, or even from his own previous encounters
with the same phenomena, so the bridge collapsed just four months after it
opened.
I fear too many of us in the embedded field are 21st
century Leon Moisseiffs, ignoring the screams of agony from our customers, the
crashed code and buggy products that are, well, everywhere. We do have a lore of
disaster. It's up to us to draw the appropriate lessons.
|