 |
The Zen of Diagnostics
This is the first of a two part series about adding diagnostics
into your programs.
Published in Embedded Systems Programming, June 1990
 |
For hints, tricks and ideas about better ways to build embedded systems, subscribe to The Embedded Muse, a free biweekly e-newsletter. No hype, just down to earth embedded talk. 23,000 other engineers subscribe. It takes just a few seconds (all we need is your email address, which is shared with absolutely no one) to subscribe to the Embedded Muse. |
For some inexplicable reason most of us embedded programmers rarely
concern ourselves with making a product manufacturable. Sure,
we work like slaves meeting performance goals, but the product
must do more than function correctly - it must be designed to
be producable. While our hardware brethren tune their designs
to meet cost, manufacturing, and performance goals, we work in
relative isolation; a sort of black hole where few dare to tread.
In high school I worked for a while in a machine shop, serving
as the lowest sort of helper to highly skilled machinists making
parts for the space program. When they discovered I intended to
go to college to become an engineer, one grizzled veteran warned
me to not "design something we can't make". Now, twenty
years later, I have to admit that these words, so casually and
freely given, have been more important than most of the EE courses
I struggled through later. Designing the most fantastic widget
ever conceived is suicidal if it cannot be made and marketed at
a profit.
In a typical manufacturing operation, boards are stuffed, assembled
into units, tested and perhaps repaired. While the code doesn't
impact board stuffing and assembly operations, it can strongly
influence product test and repair. Smart designers will produce
a product that easily fits into the company's manufacturing operation;
software engineers can contribute by writing code that speeds
the daily grind of production test.
All employees are hopefully working towards common corporate goals,
yet each has a different vision of the company's needs and problems.
To a programmer the word "testing" conjures images of
correctness proofs, exhaustive software trials, and code coverage
analysis. A production person probably has never heard of any
of these concepts; he looks at testing as the daily routine of
ensuring each and every unit works correctly before being shipped.
Conventional software test is a one time event; once the product
is complete it is over, forever (well, would you believe...).
Product test goes on every day. Very complex products are tested
and repaired by technicians with little formal computer training.
The best are usually culled and assigned to work in engineering
support, leaving production with workers who may be skilled but
who are certainly not rocket scientists. As software engineers
it is our responsibility to the company to give the techs the
tools they need to ship product. As software managers, it is our
responsibility to convince management that this is an important
and desirable goal.
Internal Diagnostics
Quite a few embedded systems include diagnostics as part of the
product's ROM to give a sort of "go/no-go" indication
without using other test equipment. The unit's own display or
status lamps show test results.
Internal diagnostics are worthwhile because they do give the test
technician some ability to track down problems. They're also an
effective marketing tool, giving the customer a (possibly false)
feeling of confidence in the integrity of the product each time
he turns it on.
Though internal diagnostics are often viewed as a universal solution
to finding system problems, their value lies more in giving a
crude test of limited system functions. Not that this isn't valuable.
Internal diagnostics can test quite a bit of the unit's I/O and
some of the "kernal", or the CPU, RAM, and ROM areas.
The computer's kernal frequently defies standalone testing, since
so much of it must be functional for the diagnostics to run at
all. Most systems couple at least the main ROM and RAM closely
to the processor. The result - a single address, data, or control
line short prevents the program from running at all.
It's easy to waste a lot of time coding internal diagnostics that
will never provide useful information. They may satisfy vague
marketing promises of self-testing capability, but why write dishonest
code? Realize that internal diagnostics have intrinsic limitations,
but if carefully designed can yield some valuable information.
Apply your engineering expertise to the diagnostic problem; carefully
analyze the tradeoffs and identify reasonable ways to find at
least some of the common hardware failures. The first step is
to separate the tests into kernal (CPU, RAM, ROM, and decoders)
and beyond-kernal (I/O) tests. Then consider the most likely failure
modes; try to design tests that will first survive the failure,
and second will identify and report it. Yes, the kernal tests
will never be very robust since lots of hardware glitches will
prevent the program from running at all. But, if carefully designed,
they'll really help your buddies in production.
I/O tests can run the gamut of a simple LED blinking routine to
A/D and D/A loopbacks that check converter linearity. I/O is just
too big of a subject to address in a short article. Today's range
of peripherals is so mind boggling that several large books might
not adequately cover the subject of testing even the most common
devices. I won't attempt to delve into a discussion of I/O tests
here.
What portions of the kernal should be tested? Some programmers
have a tendency to test the CPU chip itself, running through a
sequence of instructions "guaranteed" to prove that
this essential chip is operating properly. Witness the ubiquitous
PC's BIOS CPU tests. I wonder just how often failures are detected.
On the PC an instruction test failure makes the code execute a
HALT, causing the CPU to look just as dead as if the it never
started. More extensive error reporting with a defective CPU is
a fool's dream. Instruction tests stem from minicomputer days,
where hundreds of discrete ICs were needed to implement the CPU;
a single chip failure might only shut down a small portion of
the processor. Today's highly integrated parts tend to either
work or not; partial failures are pretty rare.
Similarly, memory tests rely on operating memory to run - a high
tech oxymoron that makes one question their value. Obviously,
if the ROM is not functioning, then the program won't start and
the diagnostic will not be invoked. Or, if the diagnostic is coded
with subroutine calls, a RAM (read "stack") failure
will prematurely crash the test, before providing any useful information.
The moral is to design diagnostics so to ensure that each test
uses as few unproven components as possible.
A carefully engineered RAM test can be quite valuable. In a multi-ROM
system, it (and all the diagnostics) should be stored in the boot
ROM, preferably near the reset address. The low order address
lines must operate to run even a trivial program, and enough of
the upper ones must work to select the boot ROM; particularly
in a small system with undecoded address lines, try to write test
routines that rely on a minimal number of working address wires.
RAM tests commonly write out a pattern of alternating ones and
zeroes, read the data back, and repeat the test using data that
is the complement of the first set. This amateurish approach reflects
poor analysis of the problem. Before writing code, put on your
hardware hat (or get some help from another member of the design
team) and consider the most likely failure modes. Tailor the test
to identify all or most of these problems. A typical list includes:
- Address/data line shorts - In nine times out of ten these
problems will crash the diagnostic. Sometimes RAM is isolated
from the kernal by buffers; in this case
- post-buffer problems can be found.
- No chip select - One or more signals may be needed to turn
each device on. The chip select complexity can range from a simple
undecoded address line to a nightmarish spaghetti of PALs and
logic. Regardless, every RAM must receive a proper chip select
to function.
- Pin not inserted - Socketed RAM devices may not be properly
seated. Sometimes a pin bends under the device.
- Bad device - The semiconductor vendors do a wonderful job
of delivering functioning chips. Rarely, though, a bad one may
slip through (or, through mishandling, one may "toggle to
the bad state"). Device geometry is now so small that it
is unusual to see the pattern sensitivity that once plagued DRAMs.
Usually the entire chip is just plain bad, making it a lot easier
to identify problems.
- Multiple addressing - This is a variant of the chip select
problem. If more than one memory device is used, several can sometimes
be turned on at once.
- Refresh - Dynamic RAMs require a periodic accesses to keep
the memory alive. This Refresh signal is generated by external
logic, or by the CPU itself. Sometimes the refresh circuitry fails.
The entire memory array must be refreshed every few milliseconds
to stay within the chips' specifications, but it's surprising
just how long DRAMs can remember their data after loosing refresh.
One to two seconds seems to be the extreme limit.
With the above in mind, we can design a routine to test RAMs.
The first criteria is that the test itself certainly cannot make
use of RAM!
Several of the failure modes manifest themselves by the inability
to store data. For example, data bus problems, a bad device, chip
select failures, or an incorrectly inserted pin will usually exhibit
a simple read/write error. The traditional write and read of a
55, followed by an AA, will find these problems quickly.
Writing a pattern of 55s and AAs tests the ability of the devices
to hold data, but it doesn't insure that the RAMs are being addressed
correctly. Examples of failures that could pass this simple test
are: a post-buffer address short, an open address line (say, from
a pin not being inserted properly), or chip select failures causing
multiple addressing. It's important to run a second routine that
isolates these not-uncommon problems.
An addressing test works by writing a unique data value to each
location, and then reading the memory to see that that value is
still stored. An easy-to-compute pattern is the low order address
of the location; at 100 store 00, at 101 store a 01, etc. This
isn't really unique, since an 8 bit location can only store 256
different values. If we repeat the test, using the locations'
high order address bytes as the data pattern, then (for memory
sizes to 64k) after two passes the entire array will be tested
uniquely. The first test insures that address lines A0 to A7 function
correctly; the second checks lines A8 to A15 and also the chip
select logic.
Since this test also insures that the RAM can store data; why
do the 55, AA check? Consider any individual address. At location
0, the addressing diagnostic will write 0 (the low address) and
later another 0 (the high address). While addressability will
be confirmed, some doubt remains about its ability to store data.
The 55, AA check tests every bit.
The peril of these diagnostics is that the address lines cycle
throughout all of memory as the test proceeds. If the refresh
circuit has failed, most likely the test itself will keep DRAMs
alive. This is the worst possible situation; the process of testing
camouflages failures. A simple solution is to add a long delay
after writing a pattern and before doing the read-back. This delay
should be on the order of several seconds. It is also important
to constrain the test code to a small area, so the CPU's instruction
fetches don't create artificial refreshes.
In the bad old days small DRAMs manufacturing defects and alpha
particles caused some memories to exhibit pattern sensitivity
problems; selected cells would hold not hold a particular byte
if a nearby cell held another specific byte. Elaborate tests were
devised to isolate these problems. The "Walking Ones"
test, in particular, burned an enormous amount of computer time
and could find really complex pattern failures. Fortunately these
sorts of problems just don't show up anymore.
Figure 1 is a routine that performs all of these tests. It is
cumbersomely coded in 8088 assembly language, using no RAM at
all. Simpler, prettier code using CALLs and RETs just will not
be dependable, since it would rely on the very RAM we're testing.
While it is easy to see some justification for testing the product's
RAM, ROM tests are perhaps not as obviously valuable. If the ROM
is not working, how can it test itself? Physician - heal thyself
(but not if you're in a coma). As always, a completely dead kernal,
one that just doesn't even boot, cannot run diagnostics. If the
boot ROM does at least partially work, then some testing is valuable.
In the boot ROM itself we can realistically expect to detect only
a simple failure, like a partially programmed device, although
with some luck it might be possible to find a shorted or open
high order address line. Luck, because if the line floats or is
tied to the wrong level, then the diagnostic code will not start.
ROMs are tedious to program - sometimes technicians will unknowingly,
in their impatience, remove the chip before the programmer is
completely done. If you do elect to include a ROM test, be sure
to locate it early in the code so it stands a chance of executing
even if the ROM is not entirely programmed. It's easier to make
an argument for ROM testing in multiple ROM systems. If the boot
ROM starts, then diagnostics located in it can test all of the
others.
While the memories can fail in a number of ways, probably the
most common is a mis-inserted pin. If you've spent time troubleshooting
electronics, you'll know that it can be awfully hard to tell if
all pins are in the sockets. Other problems cover the usual range
of broken circuit board tracks (i.e., address, data, control lines),
misprogrammed devices, and non-functioning chip select lines.
One way to test ROMs is to read the devices and compare each byte
to a known value. Since such redundancy is impractical, most programmers
simply compute an 8 or 16 bit sum of the data in the ROMs and
compare it to a known value. Usually this is adequate, but a number
of pathological cases will report incorrect results. For example,
a long string of zeroes will always checksum to zero, regardless
of the number of items summed.
A much better approach than a simple checksum is the Cyclic Redundancy
Check (CRC). The CRC is a polynomial that is seeded, typically
with FFFF, and then divided into the input data (in this case
the ROM data) a byte at a time, using the dividend at each step
as the new seed. While mathematically complicated, the CRC is
pretty easy to implement. Its great virtue is that each byte of
repeated strings (say, zeroes) will yield a different CRC. The
CRC is a bit harder to code than a simple checksum, but the code
listed in figure 2 is a cookbook solution. Once again, it is written
in 8088 assembly language to insure it uses the minimal number
of as-yet-untested CPU resources.
A CRC or checksum test is easy to code and yields useful information,
but is sometimes a nuisance to implement because the correct value
must be computed and stored in ROM. No assembler or compiler has
a built-in function to build this automatically, so generally
you must run the program under an emulator, record the CRC or
checksum the routine computes, and then manually patch the resulting
value into an absolute location in ROM. The only way to automate
this is to write a short program that CRCs the linker output file
and patches the result into the ROM file.
In Conclusion
It is best to address the diagnostics problem using the engineering
thought processes our "significant others" have learned
to hate. Examine the system dispassionately and analytically,
looking for all possible failure modes. Study the tradeoffs. Look
for alternate approaches. Then, implement the best possible solution
that does a reasonable job of solving the problem yet doesn't
take too much programming time. No one said it would be easy.
Next month I'll look at another aspect of diagnostics - how do
you report an error? We'll also look at external diagnostics that
help out the technician when the system won't even boot.
Figure 1:
**********************************************************
title Ram test code
code segment public
assume cs:code,ds:data
;
; This routine performs a RAM test on an 80x88 type machine.
;
; This test is sequenced through the following phases:
;
; 55 written to all addresses and read back
; AA written to all addresses and read back
; Low order address written and read
; High order address written and read
;
; Note that this code makes no use of RAM, so some of its
; structure may seem arbitrarily cumbersome.
;
; The test is not coded as a traditional subroutine with a
; "return" instruction. Your code must flow into it - it
; leaves at exit point ram_done.
;
; On entry, the address to test is contained in register
; DS:BX. The length of the test is in CX. It does not
; support overflows into the segment register; be sure
; that BX+CX is no more than FFFF.
;
; On exit, if the test fails register AX will be set non-zero,
; and DS:BX will contain the address of the first failed byte.
; AX will be set to 55 if the 55 test failed, AA if the AA test
; failed, 1 if the low address test failed, and 2 if the high
; test failed. Register CX will contain the number of contiguous
; bytes that are bad; note that only one contiguous block is
; tagged.
;
; If the RAM passes the test, then AX will be 0.
;
ram_test:
mov si,bx ; save start address in si
mov bp,cx ; save length in bp
mov ax,55h
ram_55_load:
mov [bx],al ; load memory with 55s first
inc bx
loop ram_55_load
mov bx,si ; restore start of test address
mov cx,bp ; restore length of test
dec bx
ram_55_check:
inc bx
cmp [bx],al ; see if we still have 55s
loope ram_55_check ; loop as long we get 55s back
test cx,cx ; if zero, test passed
jz ram_aa_test ; br if test passed
mov dx,0 ; dx will be # bad bytes found for now
mov si,bx ; set first bad address
ram_55_bad: ; failure found - find length of bad block
inc si
inc dx ; inc # bad bytes found
dec cx ; dec total # to test
jz ram_bad1 ; end of test
cmp [si],al ; see if this location is also bad
jnz ram_55_bad
ram_bad1:
jmp ram_bad ; end of the bad block found
;
; Run the AA test
;
ram_aa_test:
mov bx,si ; start address of test
mov cx,bp ; length of test
mov ax,0aah
ram_aa_load:
mov [bx],al ; load memory with AAs first
inc bx
loop ram_aa_load
mov bx,si ; restore start of test address
mov cx,bp ; restore length of test
dec bx
ram_aa_check:
inc bx
cmp [bx],al ; see if we still have AAs
loope ram_aa_check ; loop as long we get AAs back
test cx,cx ; if zero, test passed
jz ram_low_test ; br if test passed
mov dx,0 ; dx will be # bad bytes found for now
mov si,bx ; set first bad address
ram_aa_bad: ; failure found - find length of bad block
inc si
inc dx ; inc # bad bytes found
dec cx ; dec total # to test
jz ram_bad ; end of test
cmp [si],al ; see if this location is also bad
jnz ram_aa_bad
jmp ram_bad ; end of the bad block found
;
; Run the low address test
;
ram_low_test:
mov bx,si ; start address of test
mov cx,bp ; length of test
ram_low_load:
mov [bx],bl ; load memory with low address
inc bx
loop ram_low_load
;
; Add in a delay of about 2-3 seconds to let DRAMs "forget" if
; refresh is not active.
;
mov bx,si ; restore start of test address
mov cx,bp ; restore length of test
dec bx
ram_low_check:
inc bx
cmp [bx],bl ; see if we still have low address
loope ram_low_check ; loop as long we get good data back
test cx,cx ; if zero, test passed
jz ram_hi_test ; br if test passed
mov dx,0 ; dx will be # bad bytes found for now
mov si,bx ; set first bad address
ram_low_bad: ; failure found - find length of bad block
inc si
inc dx ; inc # bad bytes found
dec cx ; dec total # to test
jz ram_bad ; end of test
mov ax,si ; get address we stored
cmp [si],al ; see if this location is also bad
jnz ram_low_bad
mov ax,1 ; 1 indicates low address failure
jmp ram_bad ; end of the bad block found
;
; Run the high address test
;
ram_hi_test:
mov bx,si ; start address of test
mov cx,bp ; length of test
mov ax,0 ; assume test will pass
ram_hi_load:
mov [bx],bh ; load memory with high address
inc bx
loop ram_hi_load
;
; Add in a delay of about 2-3 seconds to let DRAMs "forget" if
; refresh is not active.
;
mov bx,si ; restore start of test address
mov cx,bp ; restore length of test
dec bx
ram_hi_check:
inc bx
cmp [bx],bh ; see if we still have high address
loope ram_hi_check ; loop as long we get good data back
test cx,cx ; if zero, test passed
jz ram_done ; br if test passed
mov dx,0 ; dx will be # bad bytes found for now
mov si,bx ; set first bad address
ram_hi_bad: ; failure found - find length of bad block
inc si
inc dx ; inc # bad bytes found
dec cx ; dec total # to test
jz ram_bad ; end of test
mov ax,si ; get address we stored
cmp [si],ah ; see if this location is also bad
jnz ram_hi_bad
mov ax,2 ; 2 indicates high address failure
jmp ram_bad ; end of the bad block found
ram_bad:
mov cx,dx ; cx=# bad bytes
; ax=test number
; bx=start of bad address
ram_done: ; exit point
code ends
data segment public
data ends
end
********************************************************
Figure 2 follows:
********************************************************
title CRC program
code segment public
assume cs:code,ds:data
;
; this routine will compute a CRC of a block of
; data starting at the address in DS:BX with the
; length in CX. Don't try to exceed a 64k segment (keep
; BX+CX <= FFFF).
;
; The CRC will be computed into register DX and compared to
; a value saved in ROM.
;
rom_test:
mov dx,0ffffh ; initialize CRC to -1
rom_loop:
mov al,[bx] ; get a character to CRC
inc bx ; pt to next value
xor al,dl ; compute crc
mov ah,al ; save temp result
shr al,1 ; shift right 4
shr al,1
shr al,1
shr al,1
xor al,ah ; xor temp with partial product
mov ah,al ; new temp
shl al,1 ; shift left 4
shl al,1
shl al,1
shl al,1
xor al,dh ; combine with high crc
mov dl,al ; save low result
mov al,ah
shr al,1 ; shift right 3
shr al,1
shr al,1
xor al,dl
mov dl,al
mov al,ah ; get temp back
shl al,1 ; shift left 5
shl al,1
shl al,1
shl al,1
shl al,1
xor al,ah
mov dh,al ; high crc result
dec cx ; dec data byte count
jnz rom_loop ; loop till all CRCed
cmp dx,word ptr cs:crc; crc match?
jnz rom_error ; error if no match
jmp rom_ok ; jmp if ok
crc: dw 0 ; save crc here
rom_error: ; error location - flag an error
rom_ok: ; rom crc compare ok
code ends
data segment public
data ends
end
|
| |
The next public class will be in the Fall of 2010. But you can bring this class to your company! Click here to find how we can come to your facility and present the class.
Jack will be speaking at the Embedded Systems Conference in San Jose April 26-29. |
|
| |
 |