Tricks of the Trade
More troubleshooting hints and kinks.
|For novel ideas about building embedded systems (both hardware and firmware), join the 39,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype and no vendor PR. Click here to subscribe.
By Jack Ganssle
Troubleshooting is more art than science. The Grand Masters of troubleshooting draw on a wealth of experience, gleaned from battles fought in the 3 AM getting-a-product-out trenches. One of the biggest challenges faced by any engineer -- and any engineering manager -- is gaining this experience quickly, so troubleshooting becomes yet another finely-honed skill in one's toolbox.
In conversations with engineers I've discovered a troubling pattern: more and more often troubleshooting seems to be relegated to a handful of old-timers. Are we seeing the beginning of the end of these critical skills?
One of the fascinating parts of having young children is learning the limits of wisdom. Kids just have to make many of the same mistakes we made, despite our desire to shield them from these errors, and despite our best wishes to help them over life's rocky paths.
I think we unconsciously adopt the same philosophy in all facets of business - more due to laziness or lack of time than from paternalistically watching our charges pick up things the hard way. Perhaps the real reason we abandon an aggressive teaching strategy is the lack of a codified "state of the art". How many of us, having mastered troubleshooting or any other art, write about this skills? How many of us even make rough notes about a cool trick or clever solution to problems?
Fact is: virtually no one does, so each generation of designers must reinvent the same skillset. Seems pretty silly, doesn't it? I'm always looking for ways of accelerating the learning process for the engineers I work with. Ideally, we as an industry will someday develop a handbook of troubleshooting wisdom. In the meantime the best we can do is pass along our own experiences, and collect other sources of knowledge.
One of the best troubleshooting references is Bob Pease's book, Troubleshooting Analog Circuits (Butterworth-Heinemann, Boston, 1991). Though aimed squarely at the analog designer, it's still a must-read for us digital folks. Never succumb to the temptation to forget that digital electronics is still electronics; those ones and zeroes are merely abstractions of specific voltages.
And so, here are a few tips I've collected, mostly through the school of hard knocks.
Next to a skeptical attitude, biased towards questioning every assumption, the most important tool we use is the oscilloscope. Emulators, logic analyzers, and all of those other nifty pieces of capital equipment have their own very important roles, but nothing measures up to a scope for 90% of normal troubleshooting.
I love toys - the shinier and newer, the more knobs and displays the better. Scopes glitter from the pages of catalogs, each with their own special features luring us into a frenzy of high-tech lust. If you can afford the best available, by all means scarf that puppy up and enjoy the thrill of 2 GHz full digital acquisition at the touch of a button.
The rest of us will often have to make do with something less extravagant. Though there is no substitute for the correct test equipment, clever use of that which we have may often be all that is required. One consultant I know still uses a vacuum-tube based 545 with only about 20 MHz bandwidth. Personally, I think he's working too hard, as spending a few grand on a modern instrument seems like a minimal price of entry to the field, but his deep knowledge of the scope, and troubleshooting skills, makes him quite successful at finding tough problems.
Then, after destroying a couple of chips by accidentally shorting things to ground with that nice alligator ground clip mounted on the probe, we tear it off in frustration, losing it as well. Tip: if you really don't intend to use the ground connection, clip that alligator lead to itself, keeping it our of harm's way but instantly available for use.
Take care of your probes. Keep them off the floor; don't let your chair roll over the leads, squishing the coax and changing its impedance. Buy decent ones before every probe in the shop falls apart. After trying all of the cheap varieties found in general electronic catalogs, I now swallow hard and spend the $150 needed to get high quality probes from Tektronix or HP.
Here's another tip: when using a scope, if a signal looks weird, maybe there's something wrong! Avoid the temptation to rationalize the problem. Instead of blaming the signal on a lousy ground, quickly connect that ground clip and test your assumption.
Never accept something that looks awful. Convince yourself that either it's actually OK, or find the source of the problem.
Walk through your lab. You'll find most of the digital folks have their vertical amplifiers set to 2 /division, which eases displaying two traces simultaneously. Unfortunately, too many of us seem to think the vertical gain knob is welded into position. It's hard to distinguish a valid zero from one drooling just a little too high with so little resolution! Flip to 1 V/division occasionally to make sure that zero is legitimate.
Every instrument is a lying beast, a source of both information and disinformation. The scope is no exception. A 100 MHz scope will show even a perfect 50 MHz clock as a sine wave, not in it's true square form. Digital scopes exhibiting aliasing - sweep too slowly (below the Nyquist limit) for a given signal, and that 50 MHz clock may look like a perfect 1 kHz signal, causing the inexperienced engineer to go crazy searching for a problem that just does not exist. You have to know your tools to use them effectively.
We digital folks deal in ones and zeroes! and tristates. Each condition means something. When troubleshooting you've got to know which of these three (not two!) states a node is in. Our best tool is the scope, yet it is inherently incapable of distinguishing the tristate condition.
In the good old days of LS technology you could be pretty sure a tristated signal would show up at around 1.5 volts - somewhere between a zero and a one. With CMOS this assurance is gone, yet most engineers blithely continue to assume that zero volts means zero. It just ain't so!
My solution is a little tool I made: a 1k resistor with a clip lead on each end. Mine is nicely soldered together and covered with insulation to avoid shorts. To tell the difference between a legal state and high impedance, clip the tool to the node and alternately touch the other end to Vcc and then ground. If the node moves more than a trifle something is wrong. The scope, plus my tool, lets me identify all three possible states. Without the tool I'm guessing, and guessing while troubleshooting always sends you down time consuming blind alleys.
You can use a variation of this approach when troubleshooting an intermittent problem. If the silly thing refuses to fail when you're working on it - a sure bet, given the perversity of nature - run your fingers over the board's pins. A purely digital board should continue to run despite the slight impedance changes brought about by your fingers, yet these may be enough to drive a floating pin to the other state, hopefully creating the failure you are looking for.
On SMT boards it's tough to get at a device's pins. If there's one pin you are suspicious of, tough it with an X-Acto knife. The sharp blade will precisely align with any tiny pin, and it's metal handle will conduct your body impedance to the node. Sometimes I'll connect my trusty pullup/pulldown clip lead to the knife itself to exercise the node more deterministically.
The most effective troubleshooting tool is a keen eye. With a working design, most problems stem from poor manufacturing. How many of us have spent hours troubleshooting a board, only to find a missing chip? Perhaps the wrong part is installed, or the correct one upside down.
In smaller companies engineering is often production's backup for troubleshooting. Don't accept boards unless a technician has performed a careful visual inspection first.
Then, inspect it yourself. It's far faster to find most manufacturing defects by eye then by component-level diagnosis. Look for those missing and backwards chips. Check soldering and solder splashes.
Inspect soldering on through-hole boards using a not-terribly sharp pointer, like an awl. Move it alone every pin, using it as a guide for your eye (which will otherwise quickly tire looking at a sea of pins). Scan the board one chip at a time, working in a logical progression from one side of the board to the other. Look for unsoldered and poorly-soldered pins, as well as solder splashes. If it looks bad, it is.
PC board defects are the most frustrating of all problems. Despite modern quality control processes they are still far too common. Keep the PCB artwork around as a reference, so you can see where the tracks run when it's time to fix a short or a design problem.
Often a new design suffers from a problem you just KNOW you can cure by grounding a signal. Be wary of using a clip lead as a grounder: high speed signals will see the lead's inductance as a high impedance. The ground end will be at ground, for sure. The signal end may not look much different than without the clip lead attached. Edges are so fast now, even in slow systems, that wires no longer act like wires. Solder a short - very short - run to ground, perhaps using a discarded resistor lead. I have found that grounding via a clip lead now only works on DC signals. Sigh! in the good old days of slow systems a mountain of clip leads were a troubleshooter's best weapons. Now look warily at that mound, realizing that a wire is not a wire.
Use all of your tools. One of our scopes has a neat digital counter. We use it for tough hardware/software troubleshooting problems. Unsure if an interrupt comes as often as it should? The counter will tell you without a doubt how many come along. Wondering if all interrupts get serviced? Put one counter on the interrupt line, and another on the acknowledge, and see that the values are identical.
Computer systems will crash and burn from a single-event. Though digital scopes are wonderful at capturing single-shot signals, it's usually much easier to work with a problem that repeats itself, often, so you can run tests at will. A logic analyzer excels at finding these one-time problems, but most won't help much with electrical (say, marginal signal levels) issues.
Always be on the lookout for ways to cause these events to repeat. For example, the easiest way to troubleshoot reset problems is to use a pulse generator to reset a dead CPU repeatedly, so you can scope the reset sequence.
Years ago we used a shortwave radio to listen to the operation of our systems' code. With a little experience we knew what sort of noise to expect in each of the instrument's important operating modes. With the volume turned to a quiet murmer, any change in its buzz instantly signaled trouble. Troubleshooting is a multi-sensory experience. Wait! What's that? It smells like a resistor burning! a wire-wound, by its odor!. The game's afoot!