An Example of Foolishness
|For novel ideas about building embedded systems (both hardware and firmware), join the 25,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype, no vendor PR. It takes just a few seconds (just enter your email, which is shared with absolutely no one) to subscribe.|
By Jack Ganssle
Published in Embedded Systems Design, June, 2001
An Example of Foolishness
It's easier to teach than to do. At least sometimes.
Last month I wrote about bringing a new hardware design to life. To quote one section: "Don't forget Vcc! As a young engineer a wiser older fellow taught me to always round up the usual suspects before looking for exotic failures. On occasion I've forgotten that lesson, only to wander down complex paths, wasting hours and days. Check power first and often."
Just days after finishing that article I decided to fix my boat's radar detector, a device that sets off a shrill alarm when it detects an active ship radar, to wake me up and take evasive action if needed. Over the winter it failed, emitting odd shrieks and trills at random.
The unit last failed in 1992 on a sail to England. I can't remember the symptoms but the cause was bad rechargeable batteries. Figuring that maybe the Ni-Cads were again dead I immediately put a voltmeter on these and found the voltage was just fine.
This is a twenty year old design, completely microprocessor-free, that even has a schematic in the manual. A bit of ancient 4000-series CMOS logic plus plenty of analog is spread across an easy-to-access circuit board. An oscillation permeated the circuit, showing up on suspiciously too many nodes. How could it be so many places? Power was OK - the voltmeter proved it - and I spent an embarrassing amount of time chasing ghosts before putting a scope probe on the batteries! and seeing the same oscillation. Not large, but clearly modulating the 6 volt supply. My assumptions of what could-be or should-be happening continued to confound reality as I tried to find something coupling into power. Finally a light dawned; replacing the batteries with a power supply completely cured the problem. Apparently as the batteries aged their internal resistance increased. Since the unit consumes just a few milliamps this condition allowed a signal to couple onto the 6 volts, rather than cause a reduction in voltage. Which, of course, is what I assumed would be the failure mode of the Ni-Cads.
I always tell young engineers to check Vcc with a scope, not a meter. For just the reasons exhibited by this problem.
My assumptions proved false, my unwillingness to use previous history (the 1992 failure) obscured the path to truth, and my ignoring basic rule of troubleshooting I had so grandly and recently written about wasted a couple of hours.
On the other hand, it was great fun to spend an afternoon fiddling with a non-microprocessor-based product! What a delight to work on probable, understandable, SMT-less device. Today most consumer products defy repair even by the original designers.
So perhaps this is a good time to review a philosophy of troubleshooting and debugging systems. It matters little whether working on the hardware or delving into firmware bugs; both require the same approach and mindset. Our goal is to extract truth from the system (what's wrong and why) and then to apply a fix. The system itself, our tools, and most importantly our assumptions and approach all conspire to keep the truth buried.
Our chief ally in this search for wisdom: the right world view; a zeitgeist of suspicion tempered by trust in the laws of physics, curiosity dulled only by the determination to stay focused on a single problem, and a zealot's regard for the scientific method. Too often we fall in love with our creations only to be blindsided by the design's faults. We're quick to overtly or subconsciously assume certainty when all too often things are not quite so certain.
Debugging and troubleshooting are not random acts perpetrated by a bewildered engineer. There's a clear process we should follow to insure that we both find the problems, and fix them completely and permanently.
The first step: Observe the system's behavior to find the apparent bug. In other words, determine the symptoms. Always remember that many problems are subtle and exhibit themselves via a confusing set of symptoms. Be wary; all too often we pursue a problem only to finally hit our heads in frustration as we realize that the system is indeed supposed to act this way.
Be sure the problem manifests itself in a repeatable way; when this is not the case work towards simplifying the actions needed to create the problem. A non-repeatable bug is all but impossible to find.
Simplify, simplify, simplify. Work on a single problem at a time. We're not smart enough to deal with multiple bugs all at once - unless they are all manifestations of something more fundamental.
Observe collateral behavior
Watch the system to learn as much related information as possible.
What else is going on when the bug appears? Often there's a correlation between the good things and the bad. Does the display flicker when firing off a big solenoid? Electrical noise, grounding, and power problems related to the properly functioning solenoid may indeed be the root of the problem.
I worked on a system that cycled a house-sized motor back and forth every few seconds; so much EMF was generated that microprocessor-based instruments all over the factory acted oddly. Developers from different companies, working on their products on the factory floor, all were chasing erratic system crashes coming from the one big motor source.
Round up the usual suspects
Lots of computer problems stem from the same few sources. Clocks must be stable and meet very specific timing and electrical specs... or all bets are off. Reset, too, often has unusual timing and electrical parameters. Examine all critical hardware signals with the scope, including NMI, DMA request, clock, wait, etc. Don't assume these are in known safe states.
One 16 bit system I lost too much youth over crashed erratically; assuming the code was at fault I searched for any clue as to what was executing at the time of the problem. Turns out the culprit was a 1 nanosecond glitch - barely detectable with the scope - on the reset input. That in turn came from a poorly-designed power fail circuit.
And of course never, never, never forget to check Vcc. Don't rely on the voltmeter; use a scope as I should have in the example mentioned above. As an ex-emulator vendor I've seen far too many systems where the power wasn't at the right voltage, being off by just a bit. Or with too much ripple. Modern CPUs are totally intolerant of even the slightest Vcc variations. Check the spec; you may be surprised to see just how little margin the part tolerates. A quarter volt isn't unusual.
Does the firmware even get to the particular code you suspect? Don't wasted time analyzing and theorizing till you're sure that you're working on the right function. Maybe the interrupt never came, so debugging that ISR is sort of pointless. Code never cares what you assume.
Generate a hypothesis
Amateurs modify things without a deep understanding of why the system is broken. Sure it's easy to change the code from " "Times New Roman"">if(a>=b)" to " mso-bidi-font-family:"Times New Roman"">if(a>b)" and hope that solves the problem. It's a foolish approach doomed to failure.
I used to watch analog engineers in awe as they soldered circuits in three-dimensional arrays of resistors and ICs that looked like a new-age sculpture. Some did indeed create a real design and then just isolate problems via this prototyping. Others fiddled with op-amp damping and feedback values without a good design, stopping "when it works". Was it any surprise that so many of those creations couldn't survive temperature extremes or production tolerance variations? I remember one system using a home-made switching power supply; when summertime thunderstorms hit the Midwest thousands of these units failed.
A device built by iterative trial and error will never be as robust as one with a solid, well-understood design. So too for debugging and troubleshooting. The bandaid fix will usually come back to haunt us.
I do feel that our tools have gotten too good, to our detriment. Older readers well remember the development environment we used in the mainframe days. We'd laboriously punch a thick deck of cards containing the program and submit them to the high priest. He'd tell us to come back in a day, maybe two, to get the job's results.
That meant the edit, compile, link and test cycle was 24 hours or more. Fast forward to 2001 and watch a typical developer: the 21 inch monitor has open editor, debugger and compiler windows. He encounters a bug. In a flash he changes something - maybe an " mso-bidi-font-family:"Times New Roman"">==" to a "!=" - compiles, links and downloads. Five seconds later he's testing again. Was the bug really fixed? Did the engineer deeply understand cause and effect or did the change simply mask the real problem?
Our ability to make changes faster than we can understand is a problem. We need to slow down. Slow down and !=" normal">think. The tools enable a dysfunctional behavior. One solution is yet another tool, this one utterly low tech. Use an engineering notebook and write down symptoms as they appear, before implementing a fix. Figure how what is really going on! and write this down before changing the code. Note your proposed fix. Only then change the code and run the test. The notebook gives us another 30 seconds of perspective on the problem, breaking the vicious cycle of "change something and see what happens".
Perhaps we don't have enough data to formulate a hypothesis. Use your tools, from BDM to ICE to scope and logic analyzer to see exactly what is going on; compare that to what you think should happen. Generate a theory about the cause of the bug from the difference in these.
Test the hypothesis
The scientific method shows us that a theory not backed up by solid experimental evidence is nothing more than a guess. Do you think the system's reset line is noisy? Prove it! Check with a scope. Does it seem the incoming data stream is occasionally corrupt? Instrument the data, or examine it with a debugger, to convince yourself that this indeed is the problem, and that this is truly worth fixing.
It's fine to be wrong; it's inexcusable to be wrong yet blindly careen ahead making changes.
Don't be so enamored of your new grand hypothesis that you miss data that might disprove it! The purpose of a hypothesis is simply to crystallize your thinking - if it is right, you'll know what step to take next. If it's wrong, collect more data to formulate yet another theory. When Chernobyl exploded Moscow sent in the USSR's top reactor experts. They walked through the parking lot, quite literally tripping over graphite chunks blown out of the building, shaking their heads and repeating "it can't have blown up". Yet the evidence was brutally obvious.
One corollary is that a problem that mysteriously goes away tends to just as mysteriously return. When you fix the bug without ever developing an adequate hypothesis, you've likely left a lurking time bomb in the product.
And never use the old "glitch" excuse. There are no glitches. Transient failures come from physical causes; it's our responsibility to find and fix those causes. It's so tempting to dismiss an intermittent bug as a glitch and move on! The Mars Pathfinder mission suffered from a software fault as it fell through the planet's atmosphere on its way to land. The mission did land successfully, and the designers later very impressively uploaded fixed code to the spacecraft, 40 million miles away. Amazing. But they saw the failure on Earth during test - twice - and characterized it as a "glitch". Beware.
Fix the bug
There's more than one way to fix a problem. Hanging a capacitor on a PAL output to skew it a few nanoseconds is one approach; another is to adjust the design to avoid the race condition entirely. We can try to beat that nasty function no one understands into submission (one more time!) or recode it to make it reliable and no longer a constant source of errors.
Sometimes a quick and dirty fix might be worthwhile to avoid getting hung up on one little point if you are after bigger game. Always, always, revisit the kludge and re-engineer it properly. Electronics has an unfortunate tendency to work in the engineering lab and not go wrong until the 5,000th unit is built. Firmware fails when stressed, when exceptions occur. If a fix feels bad, or if you have to furtively look over your shoulder and glue it in when no one is looking, then it is bad.
Finally, never, ever, fix the bug and assume all is OK because the symptom has disappeared. Apply a little common sense and scope the signals to make sure you haven't serendipitously fixed the problem by creating a lurking new one.
Feedback stabilizes systems, be they electronic circuits or our approach to making things! or even how we deal with relationships. "OK, doing that makes her sad; I'll avoid this in the future."
The very best developers I know (and there are darn few in this category) fix a problem and then look for way to never have the same problem again. Obviously most problems don't yield to such preventative measures, but it's surprising how many do. In my collection of embedded disasters the most common theme is inadequately-tested exception handlers. Clearly this tells us, if we listen to the feedback, that better ways of testing those portions of the code pays big dividends. One of the very best practices of Extreme Programming is writing detailed tests concurrently with the code, which can in many cases find these problems early.
One developer told me he found a bug where his program attempted to overwrite ROM. In this case there was a symptom leading to a bug. After finding it, he left the logic analyzer connected to continue to find any overwrites to ROM. He found seven more cases. That's a great example of employing corrective feedback.
And so, after fixing the radar detector and feeling tremendously foolish I decided to repair the back-up, a unit cheaply purchased from a surplus shop. Armed with my previous experience I checked the batteries first, found them bad, so installed new ones. The unit still didn't work, but it was then easy to locate and replace a bad potentiometer. So I guess it is possible to learn, though sometimes human nature gets annoyingly in the way.
That's one reason we need a disciplined process for debugging and troubleshooting - to guide us when chaos reigns, and when we're apt to slip up.