Disaster!

How are we gonna work with small CPUs?

By Jack Ganssle

Several thousand years ago, wide eyed and na've, I showed up for the first day of ENES 101. After an inspiring lecture informing us that most of us EE-wannabes would flunk out the instructor flipped on a projector and showed a film of the failure of the Takoma Narrows Bridge.

The bridge earned the nickname "Galloping Gertie" from its rolling, undulating behavior. Motorists crossing the 2,800-foot center span sometimes felt as though they were traveling on a giant roller coaster, watching the cars ahead disappear completely for a few moments as if they had been dropped into the trough of a large wave.

In 1940, just a few months after it opened, the Takoma Narrows Bridge collapsed after putting on a dazzling performance captured on film. A 40 mph wind fed the structure's tendency to vibrate, starting a resonance mode that caused its collapse. See http://www.ferris.edu/htmls/academics/course.offerings/physbo/MultiM/bridge/bridge.htm for MPEG clips and more information, or www.me.utexas.edu/~uer/papers/paper_jk.html for a description of the forensic engineering that uncovered the root causes of the failure.

Since civil engineers took this failure so seriously it was, in a sense, a tremendous success for bridge construction. They discovered the problem's cause; as a result, for the past 50 years no suspension bridge has been erected until it passed wind tunnel tests.

Yet after this brief introduction to one engineering disaster - one that we EEs could hardly relate to - no more disaster stories emerged. Do only civil engineers suffer from catastrophic failures?

Well, no. Even embedded systems have their share of debacles, some deadly, some expensive, and some merely embarrassing. Here's a selection of some purely embedded disasters, presented with the hope that they make it into the engineering lore so we all can learn from past problems.

The Patriot Missile

During the Persian Gulf crisis - the one in 1991 that is - Patriot Missile batteries were widely hailed for their effectiveness is destroying incoming Iraqi Scuds. Yet the successes were accompanied by failures.

The air fields and seaports of Dhahran were protected by six Patriot batteries. Alpha battery was to protect the Dhahran air base. On February 25, 1991, Alpha Battery had been in operation for over 100 consecutive hours. That's the day an incoming Scud struck an Army barracks and killed 28 American soldiers.

The problem wasn't due to a difficult-to-intercept target; rather, a latent bug in the embedded software, rather like a cancer, slowly reduced the system's accuracy as time went by.

The Patriots maintained a "time since last boot" timer in a single precision floating point number. Time, so critical to navigation and thus to system accuracy, was computed from this number. Patriots use a 100 msec timebase. Unhappily this 1/10 of a second number cannot be exactly represented by a floating point number. With 24 bit precision, after about 8 hours of operation enough error accumulated to degrade navigational accuracy.

After 8 hours time drifted by about .0275 seconds. Not much, but enough to yield a 55 meter error. The time error increased to a third of a second after 100 hours of operation, equivalent to 687 meters of targeting inaccuracy.

The problem was known and understood; the solution sounds something like what we'd hear on a tech support hotline for a PC. "Can't hit Scuds, huh? Try rebooting once in a while!" In fact, operational procedure was to reboot at 8 hour intervals until fixed software arrived.

The crew of Alpha Battery didn't get the reboot message from tech support. After 100 hour on-line, it missed the Scud by half a kilometer.

Therac 25

AECL, at the time a Canadian Crown Corporation, developed the Therac-25 in the early 80s. It was designed to treat cancers by irradiating the patient with protons or electrons at computer-controlled energy levels. The instrument apparently had a number of design flaws, which resulted in operators constantly being presented with cryptic error messages requiring system restarts.

Over a two year period six patients received massive doses of radiation from the eleven machines installed in the US and Canada. Each incident had similar pathology - the operator would initiate treatment, but get an error message indicating no dose had been supplied. Used to the machine's quirky behavior, operators would press the "try again" button - sometimes several times. In fact, software bugs were indeed dosing the patient on each trial, with radiation levels sometimes 30 times higher than desired.

Investigation found that, if the operator entered an incorrect setup value in a menu, and then edited in the correct value within 8 seconds of the initial mistake, the Therac-25's code would generate tens of thousands of Rads of radiation, yet display "no dose given". Operators assumed that nothing more than a momentary glitch occurred. Confident that the "no dose given" display was accurate they'd hit the button and dump thousands more Rads into the patient.

In fact, the Therac-25's code was apparently not all that bad. The machine worked well most of the time. The software failures were mostly due to dynamic issues resulting from running in a multitasking environment. Each task ran nearly perfectly in isolation. It wasn't until the tasks started running asynchronously and communicating that problems surfaced. A race condition, initiated by the operator's quick editing of menu items, passed bad data to various tasks.

A single programmer wrote all of the Therac's code. The RTOS was a half-breed homebrew assembly language. No one ever inspected the code. Oversight was all but non-existent. We know from bitter experience that code - especially safety critical code - simply must be inspected for accuracy (most studies conclude that code inspections are about 20 times more efficient at detecting bugs than testing). Despite the Therac 25, despite the small disasters we all experience, few of us even now use code inspections. When will we learn from the past?

For more details about the Therac-25, see IEEE Computer, July, 1993, pages 18-41 - An Investigation of the Therac-25 Accidents, by Nancy Levenson and Clark Turner. It's a good story.

Ariane 5

The European consortium formed to build and run launch vehicles has established an amazing record of success with their Ariane-series of rockets. Yet, in 1996 the first flight of a new generation of Arianes failed when 40 seconds after liftoff the launcher went off course, broke up and exploded.

The staggering cost of all space vehicles means any important failure is investigated before resuming flight tests. Such was the case with Ariane 5. Six weeks after the failure the investigating Board submitted a report that placed the firmware at fault.

A pair of identical redundant Inertial Reference Systems (whose acronym SRI is derived from the French equivalent of the name) determines rocket's attitude and position, transmitting this information to the vehicle's computer. The primary SRI provides the data, while the secondary is a hot standby, ready to immediately replace the primary in case of failure.

Within a few seconds of launch the primary SRI experienced an overflow while converting a 64 bit floating point number to a 16 bit integer. An exception handler noted the problem and shut the unit down, assuming the back SRI would take over. It did! but experienced exactly the same overflow, causing unit 2 to shut down as well. With no attitude information the Ariane inevitably broke up.

Interestingly, the SRI code was a modified version of that used successfully on Ariane 4. Though most of the conversions performed by the ADA code were protected by explicit checks to ensure their validity, the variable that caused the crash was not. The Ariane 4 could not cause such an exception with this variable. Ariane 5's different flight characteristics meant that the error was now indeed possible, but somehow this possibility was overlooked during the code port.

The investigating Board also noted that the exception handler should never have so cavalierly shut down the SRI. A automatic restart ("Hello, Ariane tech support. Trouble? Sure, just reboot that pesky SRI") could have prevented the disaster.

I suppose there are a lot of lessons here. One is to never assume that a language selection will cure all runtime ills (some of us expect too much from ADA). Another is to be very wary of porting code; all assumptions can be assumed to be invalid. Finally, error handling is profoundly important, yet is all too often not considered deeply.

The full text of the Board's report is at www.cert.fr/francais/deri/adele/DOCUMENTS/ariane5.html.

Shuttle Simulator

The second Space Shuttle launch in 1981 was delayed by a month when a fuel spill loosened a number of its tiles. The crew booked simulator time in Houston to practice a number of scenarios, including a "Transatlantic Abort", where the vehicle can neither make orbit nor can return to the launch site. Suddenly, all four main Shuttle computers (in the simulator) crashed. Part of the abort scenario requires fuel dumps to lighten the spacecraft. It was during the second of these dumps that the crash occurred.

After a couple of days of analysis, programmers discovered that the fuel management module, which had done one dump and successfully exited, when recalled for the second fuel dump, restarted thinking that this was it's first incarnation. Counters in the code had not been properly reinitialized, though, causing what was essentially a computed GOTO to branch out of the code, into a random section of memory.

A simple fix took care of the problem. More interesting, though, was the realization that this same sort of bug had appeared in different modules before. The programmers were committed to producing only the best code, so decided to see if they could come up with a systematic way to eliminate these generic sorts of bugs in the future.

They developed a list of seven questions that had a high probability of isolating similar problems. A random group of programmers then applied these questions to the bad fuel dumping module (to see if they would indeed pick up the known problem), as well as other modules.

The questions detected every one of the bugs. 17 additional, previously unknown, problems surfaced!

This is software engineering at its best. Instead of quickly fixing the bug and moving on - as most of us do - the development team transformed the defect into an opportunity to improve the code.

High Tech Toilets

A product that works perfectly may also qualify as a disaster.

My definition of quality of a software product is "The quality of any product is exactly what the customer says it is." Just as no-fault divorce and no-fault car insurance deals with the complexities of "I did it/you did it" by removing blame from the issue, we too must accept a no-fault definition of quality. The company will fail if the customer is unhappy. It matters little if marketing did a lousy job of specifying requirements or engineering did not meet those specs. All of us are responsible for delivering something that thrills the customer.

Toto, a Japanese toilet company, has created an entire line of smart toilets using embedded intelligence to outperform anything else sculpted in porcelain. The Toto Model 6, a bargain at a few thousand bucks, includes a bidet, a paper-less cleansing system, and a bottom-dryer. An automatic lid opener/closer even saves marriages by flipping the seat down when a man is done.

Perhaps this is a case of too much technology. Maybe it's simply a poor user interface. Regardless, it seems the language-independent icon on each button is not enough to tell many users how to perform certain basic functions. Users frantically trying to simply flush somehow activate the bidet, spraying the washroom. Confuse the up/down button with the one for a, ah, "virtual wipe", and you might get knocked onto the floor.

Words hardly do justice to the control panel. See a picture of it at www.theimageworks.com/toilet/toilet6.htm.

When customers cannot use your product, or when too much training is required, the product is a disaster.

(In case you're wondering, an optional remote control is indeed available.)

Conclusion

As big processors, fancy operating systems, and even GUIs become more commonplace it's ever harder to draw simple distinctions between embedded and non-embedded systems. Hey, we know what an embedded system is when we see it; it's just getting awfully hard to define what makes embedded unique.

Perhaps a fundamental quality of embedded systems is quality. Desktop applications that crash are a daily part of the fabric of life, so much so we scarcely think about the regular system reboots needed to keep our computers happy. Yet most embedded systems simply cannot crash.

The failures described above are to larger or lesser degrees disaster stories that should instill the same fear in us that the Takoma Narrows bridge still does to civil engineers. Smaller disasters, though are just as important to us. The toaster oven that catches fire, the car computer that resets from time to time, a credit card processing machine that loses transactions - all are intolerable to our customers.

We have a choice - learn and profit from these sorts of stories, or be doomed to repeat them ourselves.