Crash and Burn
Disasters, and what we can learn from them.
Published in ESP, November 2000
For novel ideas about building embedded systems (both hardware and firmware), join the 27,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype and no vendor PR. It takes just a few seconds (just enter your email, which is shared with absolutely no one) to subscribe.
By Jack Ganssle
Crash and Burn
We're not terribly good at learning from our successes; smug with the satisfaction of a job well done most of us proceed immediately to the next task at hand. It's a shame that we can't look at a job well done and then dig deeply into what happened how when, to suck the educational content of the project dry.
Ah, but failures are indeed a different story. High profile disasters inevitably produce an investigation, calls for Congress to "do something", and in the best of circumstances a change in the way things are built so the accident does not get repeated.
Isn't it astonishing that airplane travel is so reliable? That we can zip around the sky at 600 knots, 7 miles up, in an ineffably complex device created by flawed people? Perhaps aviation's impressive safety record is a by-product of the way the industry manages failures. Every crash is investigated; each yields new training requirements, new design mods, or other system changes to eliminate or reduce the probability of such a disaster striking again.
Though crashes are rare, they do occur, so airliners carry expensive flight data recorders whose sole purpose is to produce post-accident clues to the safety board. What a shame that we firmware folks don't have a similar attitude. Mostly we're astonished when our systems break or a bug surfaces. I hope that in the future we learn to write code proactively, expecting bugs and problems but finding or trapping them early, and leaving a trail of clues as to what went wrong.
I believe we should examine disasters, our own and others, as so many embedded systems crash in similar ways. I collect embedded disaster stories, not from morbid fascination but because I think they offer universal lessons. Here's a few that are instructive.
December 20, 1998, the Near Earth Asteroid Rendezvous spacecraft, after three years enroute to 433 Eros, executed a main engine burn intended to place the vehicle in orbit about the asteroid. The planned 15 minute burn aborted almost immediately; firmware put the spacecraft into a safe mode, as planned in case of such a contingency. But then NEAR unexpectedly went silent. 27 hours later communications resumed, but ground controllers found that it had dumped most of the mission's fuel.
Controllers spent a few days analyzing data to understand what happened, and then initiated a series of burns that will ultimately lead to NEAR's successful rendezvous with the asteroid. But two thirds of the spacecraft's fuel had been dumped, using all of the mission's reserves. The good news is that there's enough fuel - barely - to complete the original goals of the NEAR mission. But reduced fuel means things happen more slowly, so NEAR's rendezvous with 433 Eros will be 13 months later than planned.
Like so many real-world failures, a series of events, each not terribly critical, led to the fuel dump.
Immediately after the engine fired up for the planned 15 minute burn, accelerometers detected a lateral acceleration that exceed a limit programmed into the firmware. This momentary under-one-second transient was in fact not out of bounds for the mechanical configuration of the spacecraft. But the propulsion unit is cantilevered from the base of the spacecraft, creating a bending response that, according to the report (see reference 1) "was not appreciated". Quoting further "In retrospect, the correct thing for the G&C software to have done would have been to ignore (blank out) the accelerometer readings during the brief transient period". In other words, though the transient wasn't anticipated, the software was too aggressive in qualifying accelerometer inputs.
With the software figuring lateral movement exceeded a pre-programmed limit, it shut the motor down and put the spacecraft into a safe mode. The firmware used thrusters to rotate NEAR to an earth-safe attitude. Code then ran a script designed to change over from thrusters to reaction wheels (heavy spinning wheels that absorb or impart spin to the spacecraft) for attitude control. According to the report "Due to insufficient review and testing of the clean-up script, the commands needed to make a graceful transition to attitude control using reaction wheels were missing." Wow!
Excessive spacecraft momentum meant that the reaction wheels just weren't up to the task of putting NEAR into the earth-safe mode. The firmware did try, for the programmed 300 seconds, but then gave up and started warming up thrusters, which offer much more kick than the momentum wheels. Now the only chance to save the spacecraft was to go to the lowest level save mode, "sun-safe", where it spun slowly around an axis pointing towards the sun. This would keep the batteries charged till ground intervention could help out.
Seven minutes later an error in a data structure (i.e., a parameter stored in the firmware) led to the system thinking a momentum wheel that was running a its maximum speed was stopped. A series of race conditions, exacerbated by low batteries, led to some 7900 seconds of thruster firing over the course of many hours. Eventually NEAR did stabilize in sun-safe mode, though now missing 29 kg of critical propellant.
So NEAR's troubles stem ultimately from a transient due to an odd vibration mode - something the firmware design team could not have anticipated. This rather small transient revealed flaws in the firmware that, in large part, led to a near-catastrophe (pun intended).
The review board inspected some, but not all, of the system's 80,000 lines of code (C, ADA, and assembly). They uncovered 9 software bugs and 8 data structure errors. Bugs included poorly designed exception handlers and critical variables that could be erroneously overwritten.
Hindsight is certainly a powerful microscope, especially when zooming in on a specific problem that causes a mishap. But I can't help but wonder why the post-failure review board's firmware review was so much more effective than those - if any - performed during original design. The report's recommendation 1c insists that from now on all command scripts must be tested, especially those critical to spacecraft safety - including abort cases. Well, duh!
You'd think configuration management would be a no-brainer for a mission costing many megabucks. Turns out the flight software was version 1.11! but two different version 1.11s existed. The one not flying had the proper command script to handle the thruster to reaction wheel changeover. Astonished? I sure was. From the report: "Flight code was stored on a network server in an uncontrolled environment." Version control is not rocket science!
NEAR is by no means the only space probe to suffer from software issues; recent failed Mars missions come immediately to mind. Another asteroid-rendezvous spacecraft experienced a somewhat similar failure in 1994. Clementine, which very successfully mapped much of the moon from lunar orbit, was supposed to autonomously rendezvous with near-Earth asteroid 1620 Geographos. A software error caused a series of events that depleted the supply of hydrazine propellant, leaving the spacecraft spinning and unable to complete its mission.
A sequencing error triggered an opening of valves for four of the vehicle's 12 attitude control thrusters, using up all of the propellant. No fuel, no go.
Unfortunately, I've been unable to obtain more detailed information about the nature of the software error. However, there's an enticing - and as yet unavailable - reference in the appendix of the NEAR report to a memo called "How Clementine Really Failed and What NEAR can Learn". Is it possible that NEAR's software failure had been anticipated 4 years earlier?
NASA published a report on the mission (reference 2) that mentions the failure but does not delve into root causes. Clementine was a technology demonstrator operated by the Ballistic Missile Defense Organization; NASA was a partner, not the main force behind the mission. Reference 2, though short on firmware details, does delve into the human price of a schedule that's too tight. Here are a few quotes; there's not much one can add!
"The tight time schedule forced swift decisions and lowered costs, but also took a human toll. The stringent budget and the firm limitations on reserves guaranteed that the mission would be relatively inexpensive, but surely reduced the mission's capability, may have made it less cost-effective, and perhaps ultimately led to the loss of the spacecraft before the completion of the asteroid flyby component of the mission."
"The mission operations phase of the Clementine project appears to have been as much a triumph of human dedication and motivation as that of deliberate organization. The inadequate schedule! ensured that the spacecraft was launched without all of the software having been written and tested."
"Further, the spacecraft performance was marred by numerous computer crashes. It is no surprise that the team was exhausted buy the end of the lunar mapping phase."
In May of 1998 I described the 1996 failure of Ariane 5, the large launch vehicle that tumbled and destroyed itself 40 seconds after blast-off. Since then more information has come to my attention (see reference 3).
Shortly after launch the Inertial Reference System (SRI, the apparently scrambled acronym a result of translation from French to English) detected an overflow when converting a 64 bit floating point number to 16 bit signed integer. An exception handler noted the problem and shut the SRI down. Due to the incredible expense of these missions (this maiden flight itself had two commercial spacecraft aboard, each valued at about $100 million) a back-up SRI stood ready to take over in case of the primary's failure. SRI number 2 did indeed assume navigation responsibility! but it ran identical code, encountered the same error, and shut down as well.
Why did the overflow occur? This code had been ported from the much smaller Ariane 4. According to the report "!it is important to note that it was jointly agreed [between project partners at several contractual levels] not to include the Ariane 5 trajectory data in the SRI requirements and specification." Clearly, a decision doomed to failure. Here's a case where the firmware was in fact perfect - if perfection is measured by how well the code meets the spec. Again, "The supplier of the SRI was only following the specification given it."
As with NEAR, Ariane's crash resulted from a series of coupled events rather than any single problem. The exception was largely a result of poor specification. But designers did realize that some variables might go out of range; in fact they specifically wrote code to monitor four of the seven critical variables. Why were three left exposed? An assumption was made that physical limits made it impossible for these three to overflow (an assumption that proved expensively faulty). Further, a target of 80% processor loading meant checking all calculations would be prohibitively expensive.
But the exception itself didn't cause Ariane's crash. When both SRIs failed, they did so gracefully, and even returned diagnostic data to the vehicle's main computer that indicated the flight data was invalid. But the main computer ignored the diagnostic bit, assumed the data was valid, and used this incorrect information to guide the vehicle.
As a result of trying to use bad data, the computer commanded the engine nozzles to hard-over deflection, resulting in the tumbling and destruction of the rocket.
To complicate the picture further, the floating point operation that overflowed was a calculation not even required for normal flight operations. It was left-over code, a relict of the firmware's Ariane 4 heritage, code that had meaning only before lift-off.
The review board also noted that, though testing of the SRI is hard, it's quite possible and (gasp!) maybe even a good idea. "Had such a test been performed by the supplier or as part of the acceptance test the failure mechanism would have been exposed."
To summarize: poorly tested code that should not have been running caused a floating point conversion error because the spec didn't call for an understanding of real flight dynamics. In an effort to keep processor loading low the variables involved weren't monitored, though others were. Two redundant SRIs running the same code performed identically and shut down. The main computer ignored the SRI "bad data" bit and try to fly using corrupt information.
Another interesting tidbit from the report: "! the view had been taken that software should be considered correct until it is shown to be at fault." This is the rationale behind using identical code on redundant SRIs. It does beg the question of why insufficient testing to isolate those potential software faults occurred.
My embedded disaster collection grows daily. I expect, as embedded systems become ever more pervasive, that there's no end in sight to the firmware crisis we'll all experience.
Several common threads run through many of these stories. The first is that of error handling. Look at Ariane: when the software failed, it properly set a diagnostic bit that meant "ignore this data". Yet the main CPU blithely carried on, instead ignoring the error bit.
Inadequate testing, too, appears repeatedly as a theme in disasters. The NEAR team had simulators and prototypes, but these test platforms worked poorly. Their fidelity was suspect, leaving the engineers to wonder, when problems surfaced, if the simulator or the code was at fault. Ariane, too, had poor simulators and thus only partially tested software. On Clementine it appears that some code was not tested at all.
Interprocessor communications is a constant source of trouble. Though I'm a great believer in using multiple CPUs to reduce software complexity and workload, when too much comm is required problems result. NEAR's computers ran into race conditions. Ariane's error bit was unrecognized.
Those who don't learn from the past are sentenced to repeat it.
Reference 1: The NEAR Rendezvous Burn Anomaly of December 1998 http://near.jhuapl.edu/anom/index.html
Reference 2: Lessons Learned from the Clementine Mission, NASA/CR report 97-207442.
Reference 3: Ariane 5, Flight 501 Failure, Report by the Inquiry Board. rk.gsfc.nasa.gov:80/richcontent/Reports/Failure_reports/Ariane501.htm