Follow @jack_ganssle

The logo for The Embedded Muse For novel ideas about building embedded systems (both hardware and firmware), join the 25,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype, no vendor PR. It takes just a few seconds (just enter your email, which is shared with absolutely no one) to subscribe.

By Jack Ganssle

Failure is an Option

Published 7/29/2002

"I know it's male, Dad", my daughter exclaimed, pointing at the viciously-snapping Blue Crab we had just landed in a bucket. I wanted to teach her how to tell the sex of this so-delicious product of the Chesapeake Bay by examining the shape of the creature's shell. "How do you know that, Kristy?", I wondered, my scientific description momentarily preempted. The crab reared back, snapping, angry, and aggressive. "By his attitude," she replied.

Later, pondering this conversation and her surprisingly deep insight into the male psyche, I thought about a chunk of code I'd recently read. (I know, I know, the peril of being a dweeb is we see everything in terms of our work). It was a pretty typical bit of firmware, peppered with both the usual flashes of brilliance and egregious flaws.

This firmware used quite a bit of dynamic memory management. Malloc()s and Free()s abounded. Some developers consider the use of anything other than statics a sin, but it is possible to write really great embedded code with either sort of memory management. But this particular code, like every bit of similar firmware exploiting dynamic memory allocation, was clearly written by a guy. No - there was no clue to the person's sex in the comments, the author's first name nothing but an utterly unrevealing initial. His testosterone came across in the same way exhibited by the crab. Whenever memory was needed, he issued a malloc, and by God, that was a demand from this programmer to get some memory! Gimme some memory, dammit, NOW!

A gentler and more robust approach might be to recognize that malloc has a return code. It might fail. Check the return code and take action if the system is low on resources. Demand memory? No, request it, with a sort of virtual "please", and recognize that it's not unlikely the system will reply, "well, sorry, there's none available now".

I read a lot of code. A lot. And I've learned that we developers are an optimistic and an aggressive lot, blithely expecting that the computer will always respond in the way we hope. The program will always run just as we decided it would. Nothing will fail (other than hardware, of course). Code, being deterministic, will always work properly once we pronounce it "done".

But it just ain't so! I collect embedded disaster stories, and the most common theme that runs through them all is poor or no exception handling. Everything runs fine until something rather simple goes wrong, the exception handler gets called, and (if it exists at all), it's poorly-thought-out, buggy, or totally inadequate, so the entire system crashes. Exception handlers are our last-ditch chance to save the system. Treated as afterthoughts they're sure to exacerbate the problems.

We've learned that datacomm is unreliable, so use TCP/IP which is robust. It'll transfer correct data despite missing and corrupt packets. Yet a friend's autopilot passes data between boxes using a proprietary comm protocol that tolerates no errors. Press a nearby radio's transmit button and the data link gets scrambled, crashing the code. We know that it's a disaster to pass a routine data which is out of range, yet rarely do I see debugging snippets (like assert macros) embedded to catch these common problems; when such constructs do appear they're usually slammed in as acts of panicked desperation. On processors with divide-by-zero traps I nearly never see a handler for this condition, despite the curse such divisions have had on the computer industry for 50 years.

Unexpected stuff happens. Users do astonishingly foolish things. Complex interactions are the norm, not the exception; none of us are smart enough to predict all possible paths.

Why do we persist in writing such fragile systems? When do we accept the fact that failure is normal, weird things happen regularly, and it's our responsibility to catch, process, and safely dispose of such conditions?