By Jack Ganssle

Watchdogs

Published 7/21/2007

Are watchdog timers crutches for lousy developers?

In a number of recent emails some readers claim that great embedded products don't need a watchdog. Correspondents reason that watchdogs are the last line of defense against software crashes, so write great code and your system will be crash-proof.

I disagree.

Software is unique in that it's probably the only human endeavor - and certainly the only engineering field - where it's at least theoretically possible to achieve perfection. Software is unmarred by the gritty realities of poor castings, cyclic loadings and counterfeit parts that mechanical engineers must deal with. It doesn't suffer from EE nightmares like lightening strikes and poor solder joints.

But software isn't something that comes down from on high. It's designed and built by imperfect humans, who craft their code from often-misinterpreted and vague requirements, and interface the software to other complex systems whose behavior is usually poorly-specified. Complexity grows exponentially; Robert Glass figures for every 25% increase in the problem's difficulty the code doubles in size. A many-million line program can assume a number of states who size no human can grasp.

Perfection, giving these challenges, will be elusive at best. And how can one <i>prove</i> their code is perfect?

The review board that studied the software-induced $500 million Airane 5 failure had a number of conclusions. One was that the organization had a culture that assumed software cannot fail. A half century of experience has taught us quite the opposite.

Software doesn't run in isolation. It's merely a component of a system. Watchdogs are not "software safeties." They're system safeties, designed to bring the product back to life in the event of any transient event that corrupts operation, like cosmic rays. Xilinix, Intel, Altera and many others have studied these high energy particles and have concluded that our systems are subject to random single event upsets (SEUs) due to these intruders from outer space.

Currently cosmic ray SEUs are thought to be relatively rare. One processor datasheet suggests we can expect a single error per thousand years per chip. That sounds pretty safe till you multiply that by millions, or hundreds of millions of processors shipped per year. But recent research suggests that as geometries scale below 65 nm even SRAMs will be surprisingly vulnerable to random SEUs.

Systems and software operate in a hostile world peppered with threats and imperfections that few engineers can completely anticipate or defend against. A watchdog timer, which requires insignificant resources, is cheap and effective insurance. It's the fuse EEs have routinely employed for a hundred years, and it's one that automatically resets.

What do you think? Do your products use a watchdog?