The Tao of Diagnostics
Part 2 of a series about embedded diagnostics.
Published in ESP, July, 1990.
 |
For hints, tricks and ideas about better ways to build embedded systems, subscribe to The Embedded Muse, a free biweekly e-newsletter. No hype, just down to earth embedded talk. 23,000 other engineers subscribe. It takes just a few seconds (all we need is your email address, which is shared with absolutely no one) to subscribe to the Embedded Muse. |
A few weeks ago I had our broken microwave apart in my workshop.
It says something about our business that one of my greatest fears
is repairing a microprocessor-based product; a simple chip failure
often consigns an appliance to the landfill. Fortunately, this
was a simple case of the door microswitches not engaging properly.
After a bit of study, I discovered that they had to close in one
particular processor-monitored sequence, no doubt to prevent backyard
mechanics from bypassing them and getting fried. The correct sequence
was (of course) undocumented and difficult to adjust.
This is an all too common example of poor embedded design. Every
system should have some provision for in-the-field repair. Software
engineers have a responsibility to make these adjustments easier.
Why didn't the designers include a little code to show what sequence
the switches engaged in?
As I mentioned last month, it's really impossible to say intelligent
things about the huge range of I/O used in embedded systems. But,
for God's sake, let's use our brains! Be sympathetic to the user's
needs. Remember that the system will fail, either in the field
or in production test. Make it easy to isolate the problem.
Like the microwave oven example, most embedded systems interface
to mechanical and electronic sensors and actuators. Certainly
the mechanical portions are prone to failure; just as certainly
analog I/O is subject to drift, noise, and other effects that
we digital people hate to acknowledge. The tests described last
month (and those you've invented in your never-ending quest to
build a reliable product) will help get the system to boot. The
next step is to give the test technician and end user a "back
door" into a diagnostics suite.
Consider the system's analog circuits. These components all exhibit
slightly different characteristics, so potentiometers are used
to tune offsets and gains. Sometimes, lots of pots are used. It's
interesting to watch a test group calibrate these sorts of instruments;
frequently special test equipment is needed to monitor the voltages
during pot twiddling. Without this equipment these adjustments
simply cannot be made in the field. In most cases a bit of clever
software can take advantage of a panel display to replace the
test gear. Write a bit of code to show raw voltage, or whatever
is being monitored, on the system's own output device. Certainly
you've already written low level routines to get the data (for
use in the main program); spend an afternoon writing a simple
diagnostic that calls this subroutine and formats the output.
Pots are a continuous source of frustration to users. Think -
can you come up with a better, self-calibrating design? Try writing
code that removes offset, gain, and other errors mathematically.
The Scope
When you design diagnostics for field or in-house use, be sure
to bear in mind the sorts of tools users will have available.
One of the most useful troubleshooting tools is the venerable
oscilloscope. Most test and repair technicians rely on the scope
almost exclusively. Logic analyzers, emulators, and the other
tools used in engineering are not nearly so ubiquitous in the
test environment. Remember this when writing diagnostic code.
Yin and yang. While the scope is the universally accepted troubleshooting
tool, computer-based systems are not really well suited to scope
diagnosis. Digital events tend to be wide (requiring many channels
- like for addresses and data), or very intermittent (a 1 microsecond
event once per second). Even the most sophisticated scope can't
capture these signals without some help from the code. The solution?
Write diagnostics that run in repetitive loops, and be sure to
toggle a bit (say, an I/O port) at the start of each loop. The
technician can trigger the scope's sweep (i.e., start the trace
at the left side of the screen) each time the bit is asserted.
This "scope trigger point" gives an essential reference
to the sequencing of events, in many cases making the scope as
useful as logic analyzer.
The best software engineers regularly make use of scopes during
initial code debugging. It's amazing just how much information
you can extract from the code by watching event synchronization,
port assertions, or even chip selects on a scope's display. If
you are not familiar with this valuable tool, have a hardware
guru give you a lesson in its use. Debugging embedded code is
hard - take advantage of every tool you can find.
Reporting Failures
I've always hated the annoying beep my Macintosh makes on reset.
Until this year, that is, when the computer died with a dramatic
belch of smoke. Where, exactly, was the failure? The screen went
blank - was the CPU dead? Could the power supply have failed?
But wait - on reset the computer still beeps! Power must be OK,
and the CPU is probably working. Indeed, it turned out the problem
was localized to the video circuits, and a $100 mail order board
brought the computer back to life. The once annoying beep saved
an expensive trip to the Mac man.
Years ago Computer Automation installed Go/Nogo LEDs on every
board in their "Naked Mini" (their name, not mine) computers.
Like the Mac's beep, these simple indicators save users a lot
of grief. Nothing this simple is foolproof, but even an 80% success
rate is worthwhile.
Certainly systems with CRTs or other alphanumeric displays can
easily show lots of useful error information. Working in C makes
formatting output especially easy. Use these resources, but don't
depend on them. An awful lot of hardware and software must work
before even a single character can be displayed on a CRT; self
test routines should depend on the absolute minimum of functioning
hardware.
Learn from the automotive companies. Cars have a lot of sensors,
all wired to an under-hood computer. Dozens (at least) of potential
failure nodes exist. Ford, GM, and others let the mechanic put
the computer into a self-test mode, and flag errors by toggling
one bit very slowly. The engineers cleverly realized that a voltmeter
is about all you can count on a mechanic having and understanding,
so their software drives the bit up and down so slowly that even
a meter needle can show the transitions. Error 51 might mean "failed
PCV valve", and is indicated by 5 needle deflections, a pause,
followed by one more. What could be simpler?
A LED is just as effective and even easier to use. If the product
is too cost sensitive to include even a 50 cent LED, provide a
place to clip one on.
If you use a LED rather than a voltmeter, than the flashes can
be quite a bit faster. A subroutine to show one digit of a code
is simple, and typically takes the following form:
Pseudocode:
Set COUNT=# flashes wanted
LOOP: turn LED ON
delay for 1/4 second
turn LED off
delay for 1/2 second
COUNT=COUNT-1
Go to LOOP as long as COUNT is non-zero
Avoid using zeroes as part of an error code. While zero might
correspond to "no flash", it is visually very confusing.
Showing error codes to a single LED is arguably better than showing
the complete code in a conventional 7 segment or ascii display.
The single bit approach is more robust; not much hardware support
is needed. If the system has a number of LEDs, consider sending
the same pattern to all of them. A single LED (or port) failure
will then be obvious, and the remaining LEDs will still show the
error code.
ROM Monitors
Let's not forget the sophisticated troubleshooter. We've all had
the unpleasant experience of being called in to find and fix design
flaws. Build in tools to make this sort of work easier for you
and your associates.
If the embedded system includes some sort of terminal interface,
then including a monitor (or "remote debugger") is a
nice way to give the high-end user access to the system's internals.
A ROM monitor may not be as powerful as an emulator or logic analyzer,
but it is easy to invoke. A built-in monitor is like a sleeping
giant, dormant, waiting to be called into action by entering a
secret command. But be careful - I once failed to check for keyboard
overflow in a product, and a user called to complain about the
weird mode (the monitor) that the product entered when his cat
sat on the keyboard.
Even a simple monitor lets you change and examine memory and I/O.
Giving the hardware troubleshooter access to I/O can save him
hours of work - entering an input command to see what a port does
is much simpler than trying to capture the event on a logic analyzer.
If you feel really generous with your time, display the status
of all system I/O in a table, converting cryptic hex statuses
to meaningful keywords. "Data ready" is a lot easier
to understand than "02".
A disassembler, assembler, and simple breakpoints is a lot more
work to add, but if you go through the trouble you can then patch
small test routines into the product's RAM. At the very least
have a GO command that starts a program at any address. Then,
you can patch in instruction hex codes and start simple test loops
that perhaps cycle a particular port. The scope-happy technicians
will love you for it. Is a port very occasionally intermittent?
A few bytes of code can monitor this much more effectively than
any other means.
A monitor can serve as a diagnostics platform. It is any easy
way to invoke complex test routines, and gives the basis of a
nice interface for communicating test results. Like Microsoft's
new Programmer's Work Bench, it is a sort of software bus to hang
diagnostics and other utilities from.
All of my company's products include such a monitor. Our customers
are not aware of it, but in our lab we regularly invoke it to
diagnose all sorts of problems.
A number of companies sell commercial ROM monitors. First Systems,
Microtec, and Intermetrics all provide quite sophisticated products
that can be included in a design.
Diagnostics Tricks
I could go on at great length about using powerful troubleshooting
aids like emulators and Fluke's Microsystem Troubleshooter. These
sorts of tools quickly find bus shorts and other problems that
prevent the computer from coming up at all. If it doesn't boot,
then all the internal diagnostics in the world are useless. If
the techs don't have decent tools, then they will be reduced to
"shotgunning" - replacing components at random and hoping
for success.
You can make their job a bit easier during the product's hardware
design. (Yes, programmers should be involved in hardware design,
at least to the extent of contributing their expert knowledge
to make the system as close to perfect as possible). A nice way
of finding bus shorts, memory failures, and the like is to execute
a looping program, letting the technician examine each address
and data line with a scope to find the source of the trouble.
Of course, if the memories don't work, or if the address bus is
shorted, how can we run a program?
On the Z80 and 8085 family the RST 7 instruction is a one byte
CALL to location 38. Was Intel clairvoyant, or was it just luck
that caused them to use opcode FF for this instruction? As a result,
if you add pullup resistors to the bus, then simply removing all
memory chips will make the processor execute CALLs to 38 all day
long. The stack pointer will decrement through the processor's
entire address space, so the technician can look at address lines
and check that they cycle properly. The data bus will show return
addresses after each RST 7 executes; since the stack pointer decrements,
these addresses will change as well. This trivial test gives the
repetitive signal needed to effectively use a scope to check out
the hardest parts of the system.
Other CPUs usually have a similar instruction. On the 8088 family
the INT 3 instruction is a similar one byte opcode. A one byte
PUSH might even be better. Since these instructions are not FF
opcodes, pull up the bus and add a jumper field so the technician
can set the proper opcode.
Be sure that at least some of the diagnostics can run with the
absolute minimum amount of the system working, and minimal number
of boards plugged in. Think about the example set by the Naked
Mini - diagnostics were limited to each card, reducing potential
for interaction between system components.
OK, so you say this is a one-off unit that will never be reproduced,
and that has a design life only a few weeks. Why spend time writing
diagnostics? This is a valid point, but even in these extreme
cases be sure the system has at least an "easy mode".
That is, be sure that on power up (or by installing a jumper or
setting a switch) a dramatic event occurs - say, a lamp lights.
This way you can tell in a second if the computer is running and
power is applied. You don't want to spend time chasing timing
problems in a complex system when the computer hasn't even started.
As a company, we're all in this together - right? Use your expert
knowledge, and your knowledge of everyone else's job (after all,
we all should strive to be high tech Renaissance Persons), to
make the job of the techs in production test and repair easier
(or even possible).
The microwave oven is fixed, but my workbench is still littered
with broken electronics. Now, if I could only get my car radio's
FM section to work...
|