Troubleshooting 101

Troubleshooting is more art than science. Here's a few ideas.

Published in EDN, November 1995

For novel ideas about building embedded systems (both hardware and firmware), join the 40,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype and no vendor PR. Click here to subscribe.

By Jack Ganssle

I've worked with a lot of engineers over the years. Most have a single area of expertise: design of complex high speed systems, firmware wizards, or even troubleshooting geniuses. A few, the very best, are adept at every area of embedded design. Surely you've met that solitary genius who quietly and competently creates a paper design, guides it through prototyping, develops a bit of test or application code, and somehow, without fuss, just makes it work.

This industry continues to evolve in fascinating ways. Only a generation ago computer design was about as complex an art as existed. With the invention of the microprocessor this changed. Slap a micro, a couple of memory chips, and some standard peripherals together and voila! You've got an embedded system. Most of the complexity lived in the software.

Now we're oozing back to complexity in hardware design, spurred on by new developments - like FPGAs and complex PLDs - that resemble software in their use.

A wag commented that being a programmer means never having to say you are done. Modern technology has put us hardware folks in the same unhappy situation. "Oh, don't worry - it's just a simple FPGA change" is the new mantra. Worse: "well, just ship it and we'll get a better set of equations out later".

Complex hardware design implies tough troubleshooting problems in bringing up prototype units. High density programmable chips make life much harder as pin limitations yield little insight into the internals of a 10,000 gate part.

We need to elevate efficient troubleshooting from its current status as an art to that of a science. Too many engineers, particularly young ones just out of school, are left adrift with no idea where to turn when the damn thing doesn't work.

Speed Up by Slowing Down

The Jedi master engages an opponent by clearing his mind and calling on the Force. Hey, troubleshooting is hard, so call on anything you can! At the very least follow the Jedi's example by starting with a clear mind, a clean bench, and an organized set of tools.

Too many designers jump into a problem without getting ready to do battle. You see them, empty junk food bags piled atop the poorly maintained test equipment, scattered debris from a dozen other troubleshooting contests buried under the latest set of schematics.

Clean up. Get rid of all of those short-producing solder splashes and old resistor leads. Consider mounting stand-offs on PCBs so they don't lie in the bench debris. Sort out your tools. Make sure you have enough outlets at hand to avoid power plug mania. Get a pile of clip leads. Is your lab notebook open and ready for action. What! You don't use one? Where do you record the things you learn (like mods needed to the board) - on an easily-lost scrap of paper? Get a bound notebook that is always at hand for your lab work. Use it daily. Record poetry, love notes, or ideas for science fiction stories... but log engineering details, experimental setups, and the latest neat idea you'll try first thing in the morning.

Never just do something - automate it. Build batch files to download your code and initialize the tools. Program the logic analyzer setup and save it to disk. Your employer is paying you to think; repetitive tasks that you could have automated could be done by a monkey.

I have a love-hate relationship with the logic analyzer. It's a fantastic tool that yields information obtainable no other way with wonderfully precise timing resolution. It's just such a pain to connect 50 or 100 leads to run an experiment. In digital systems most of the analyzer's leads will go to the address and data buses. Build a standard connector you attach these to. We buy extra analyzer pod-ends we can permanently connect to a "standard" internal connector, greatly speeding the process of connecting the instrument.

Avoid wire-wrapped prototypes. Digital designs are simply too fast now. Rapid turn PCB vendors (look at the ads in this magazine) will produce a 10 layer board in a week for a quite reasonable fee. The PCB will eliminate all of the noise uncertainly inherent in a wire wrapped design. As an engineering manager, I'm always terrified by that oh-so-common statement "well, this doesn't really work, but the PCB layout probably will." Prove it. Go with PCBs from the outset.

Assumptions

A misspent youth of blaring rock 'n roll left my hearing somewhat impaired, but helped formulate, of all things, my philosophy of troubleshooting digital systems. The title of the Firesign Theatre's "Everything You Know is Wrong" album should be our modern anthem for making progress in the lab.

I hate getting called into a troubleshooting session and finding that the engineer "knows" that x, y, and z are not part of the problem at hand. Everything you know is wrong! Is that 5 V supply really 5 V at the PCB? What makes you think ground goes to the chips - when a single part has 5 or 10 ground connections, make sure all of them are connected. Could the system be dead because there's no clock signal? Are you sure the design isn't really working - could your experiment be flawed?

Assume nothing. Test everything. The PCB may have manufacturing errors on internal layers. Power and ground may not be on the pins you expect - particularly on newer high density SMT parts. Signals labeled without an inversion bar may actually be active low. You might have ROMs mixed up. Perhaps someone loaded the wrong parts on the board.

Never blindly trust your test equipment - know how each instrument works and what its limitations are. If two signals seem impossibly skewed by 15 nsec on the logic analyzer, make sure this is not an artifact of setting it to sample too slowly. When your 100 MHz scope shows a perfectly clean logic level, remember that undetected but virulent strains of 1 nsec glitches can still be running merrily around your circuit.

When you do see a glitch, one that seems impossible given the circuit design, remember that manufacturing shorts can do strange things to signals. Is the part hot? A simple finger test may be a good short indicator.

Learn to Estimate

At the peril of sounding like one of the ancients, I do miss the culture of the slide rule. Though accurate answers might have been elusive, we did learn to estimate the answer for every problem before attempting a solution. Alas, it's a skill that is fading away.

Calculator abuse - computing without thinking - is now too ingrained in our society to waste effort fighting. Bummer. Other instruments, though, also tempt us to mentally coast, to do things without thinking. Take the scope: I can't count the times an engineer mentioned that he sees the signal... but has no idea, when I ask, about the width of the pulse. Is it 1 nsec? 1 usec? Perhaps a second wide?

Timing is critical in computers, yet too many of us use the scope as a sort of logic probe. "Hey, the signal is there!" Which signal? If you expect a 10 usec pulse every msec, then any deviation from that norm is simply wrong. Know what to expect, and then ensure the waveforms are approximately correct. A misused scope will generate a morass of misinformation.

Estimate the performance of firmware before writing it. Sure, it's tough to know how many microseconds an as-yet-unwritten function will chew up, but you can use your general knowledge of systems to make some ballpark estimates about where problems will occur.

For example, a fast serial link might overrun a busy CPU. Estimate! 38,400 baud is about 4000 characters/second, or one character per 250 usec. 250 usec is not a lot of time for any CPU, particularly the typical embedded 8 bitter. Your processor will be pretty busy servicing the data. If polled, then only heroic efforts will keep you within the 250 usec timing margin.

Suppose you chose to implement the serial receive routine as an ISR - what is the overhead? An assembly routine to queue incoming data will need a dozen or two instructions, each of which will no doubt burn up two or three machine cycles. Surely you know roughly how long a machine cycle takes (including wait states) for your system... don't you? Given this information you can get a reasonable timing estimate before writing a line of code.

Recently an engineer told me that "that initialization loop is clearly the problem." Oh yeah? He was looking for something burning up almost a second of time, when clearly, regardless of processor, 1000h memory zeroing iterations will run in a few milliseconds. Use your tools, one of which is your brain, to make sure you are addressing the real problems.

Common Sense

Think, don't do. Recently I saw a technician troubleshooting a board that exhibited multiple problems. One chip was hot enough to fry eggs, yet he chose to work on another, "unrelated" symptom. Dumb move - surely the part was ready to self destruct, which surely would create yet more grief for the poor tech.

When starting out debugging a very fast system, crank the clock rate down to absurdly low levels. Fix the easy stuff - logic errors and the like - before tackling high speed timing. Why deal with a vast ocean of troubles simultaneously?

When you do find the problem, and then make a change, sometimes the modification won't help. Before doing anything double check the change. Did you solder the wire to the right pin? The right IC? We tend to program ourselves to look for hard problems instead of the all-to-common simple mistakes.

Plan ahead. Think before doing. Don't try something without knowing what the possible outcomes are... and without having some idea what you'll do for any of those outcomes. You may find that the next step will be the same regardless of the results of the experiment. In this case, save time and do something else.

The best troubleshooters are closet chess grand masters. They think many steps ahead.