The Zen of Diagnostics
This is the first of a two part series about adding diagnostics into your programs.
Published in Embedded Systems Programming, June 1990
For novel ideas about building embedded systems (both hardware and firmware), join the 40,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype and no vendor PR. Click here to subscribe.
By Jack Ganssle
For some inexplicable reason most of us embedded programmers rarely concern ourselves with making a product manufacturable. Sure, we work like slaves meeting performance goals, but the product must do more than function correctly - it must be designed to be producable. While our hardware brethren tune their designs to meet cost, manufacturing, and performance goals, we work in relative isolation; a sort of black hole where few dare to tread.
In high school I worked for a while in a machine shop, serving as the lowest sort of helper to highly skilled machinists making parts for the space program. When they discovered I intended to go to college to become an engineer, one grizzled veteran warned me to not "design something we can't make". Now, twenty years later, I have to admit that these words, so casually and freely given, have been more important than most of the EE courses I struggled through later. Designing the most fantastic widget ever conceived is suicidal if it cannot be made and marketed at a profit.
In a typical manufacturing operation, boards are stuffed, assembled into units, tested and perhaps repaired. While the code doesn't impact board stuffing and assembly operations, it can strongly influence product test and repair. Smart designers will produce a product that easily fits into the company's manufacturing operation; software engineers can contribute by writing code that speeds the daily grind of production test.
All employees are hopefully working towards common corporate goals, yet each has a different vision of the company's needs and problems. To a programmer the word "testing" conjures images of correctness proofs, exhaustive software trials, and code coverage analysis. A production person probably has never heard of any of these concepts; he looks at testing as the daily routine of ensuring each and every unit works correctly before being shipped.
Conventional software test is a one time event; once the product is complete it is over, forever (well, would you believe...). Product test goes on every day. Very complex products are tested and repaired by technicians with little formal computer training. The best are usually culled and assigned to work in engineering support, leaving production with workers who may be skilled but who are certainly not rocket scientists. As software engineers it is our responsibility to the company to give the techs the tools they need to ship product. As software managers, it is our responsibility to convince management that this is an important and desirable goal.
Quite a few embedded systems include diagnostics as part of the product's ROM to give a sort of "go/no-go" indication without using other test equipment. The unit's own display or status lamps show test results.
Internal diagnostics are worthwhile because they do give the test technician some ability to track down problems. They're also an effective marketing tool, giving the customer a (possibly false) feeling of confidence in the integrity of the product each time he turns it on.
Though internal diagnostics are often viewed as a universal solution to finding system problems, their value lies more in giving a crude test of limited system functions. Not that this isn't valuable. Internal diagnostics can test quite a bit of the unit's I/O and some of the "kernal", or the CPU, RAM, and ROM areas.
The computer's kernal frequently defies standalone testing, since so much of it must be functional for the diagnostics to run at all. Most systems couple at least the main ROM and RAM closely to the processor. The result - a single address, data, or control line short prevents the program from running at all.
It's easy to waste a lot of time coding internal diagnostics that will never provide useful information. They may satisfy vague marketing promises of self-testing capability, but why write dishonest code? Realize that internal diagnostics have intrinsic limitations, but if carefully designed can yield some valuable information. Apply your engineering expertise to the diagnostic problem; carefully analyze the tradeoffs and identify reasonable ways to find at least some of the common hardware failures. The first step is to separate the tests into kernal (CPU, RAM, ROM, and decoders) and beyond-kernal (I/O) tests. Then consider the most likely failure modes; try to design tests that will first survive the failure, and second will identify and report it. Yes, the kernal tests will never be very robust since lots of hardware glitches will prevent the program from running at all. But, if carefully designed, they'll really help your buddies in production.
I/O tests can run the gamut of a simple LED blinking routine to A/D and D/A loopbacks that check converter linearity. I/O is just too big of a subject to address in a short article. Today's range of peripherals is so mind boggling that several large books might not adequately cover the subject of testing even the most common devices. I won't attempt to delve into a discussion of I/O tests here.
What portions of the kernal should be tested? Some programmers have a tendency to test the CPU chip itself, running through a sequence of instructions "guaranteed" to prove that this essential chip is operating properly. Witness the ubiquitous PC's BIOS CPU tests. I wonder just how often failures are detected. On the PC an instruction test failure makes the code execute a HALT, causing the CPU to look just as dead as if the it never started. More extensive error reporting with a defective CPU is a fool's dream. Instruction tests stem from minicomputer days, where hundreds of discrete ICs were needed to implement the CPU; a single chip failure might only shut down a small portion of the processor. Today's highly integrated parts tend to either work or not; partial failures are pretty rare.
Similarly, memory tests rely on operating memory to run - a high tech oxymoron that makes one question their value. Obviously, if the ROM is not functioning, then the program won't start and the diagnostic will not be invoked. Or, if the diagnostic is coded with subroutine calls, a RAM (read "stack") failure will prematurely crash the test, before providing any useful information.
The moral is to design diagnostics so to ensure that each test uses as few unproven components as possible.
A carefully engineered RAM test can be quite valuable. In a multi-ROM system, it (and all the diagnostics) should be stored in the boot ROM, preferably near the reset address. The low order address lines must operate to run even a trivial program, and enough of the upper ones must work to select the boot ROM; particularly in a small system with undecoded address lines, try to write test routines that rely on a minimal number of working address wires.
RAM tests commonly write out a pattern of alternating ones and zeroes, read the data back, and repeat the test using data that is the complement of the first set. This amateurish approach reflects poor analysis of the problem. Before writing code, put on your hardware hat (or get some help from another member of the design team) and consider the most likely failure modes. Tailor the test to identify all or most of these problems. A typical list includes:
- Address/data line shorts - In nine times out of ten these problems will crash the diagnostic. Sometimes RAM is isolated from the kernal by buffers; in this case
- post-buffer problems can be found.
- No chip select - One or more signals may be needed to turn each device on. The chip select complexity can range from a simple undecoded address line to a nightmarish spaghetti of PALs and logic. Regardless, every RAM must receive a proper chip select to function.
- Pin not inserted - Socketed RAM devices may not be properly seated. Sometimes a pin bends under the device.
- Bad device - The semiconductor vendors do a wonderful job of delivering functioning chips. Rarely, though, a bad one may slip through (or, through mishandling, one may "toggle to the bad state"). Device geometry is now so small that it is unusual to see the pattern sensitivity that once plagued DRAMs. Usually the entire chip is just plain bad, making it a lot easier to identify problems.
- Multiple addressing - This is a variant of the chip select problem. If more than one memory device is used, several can sometimes be turned on at once.
- Refresh - Dynamic RAMs require a periodic accesses to keep the memory alive. This Refresh signal is generated by external logic, or by the CPU itself. Sometimes the refresh circuitry fails. The entire memory array must be refreshed every few milliseconds to stay within the chips' specifications, but it's surprising just how long DRAMs can remember their data after loosing refresh. One to two seconds seems to be the extreme limit.
With the above in mind, we can design a routine to test RAMs. The first criteria is that the test itself certainly cannot make use of RAM!
Several of the failure modes manifest themselves by the inability to store data. For example, data bus problems, a bad device, chip select failures, or an incorrectly inserted pin will usually exhibit a simple read/write error. The traditional write and read of a 55, followed by an AA, will find these problems quickly.
Writing a pattern of 55s and AAs tests the ability of the devices to hold data, but it doesn't insure that the RAMs are being addressed correctly. Examples of failures that could pass this simple test are: a post-buffer address short, an open address line (say, from a pin not being inserted properly), or chip select failures causing multiple addressing. It's important to run a second routine that isolates these not-uncommon problems.
An addressing test works by writing a unique data value to each location, and then reading the memory to see that that value is still stored. An easy-to-compute pattern is the low order address of the location; at 100 store 00, at 101 store a 01, etc. This isn't really unique, since an 8 bit location can only store 256 different values. If we repeat the test, using the locations' high order address bytes as the data pattern, then (for memory sizes to 64k) after two passes the entire array will be tested uniquely. The first test insures that address lines A0 to A7 function correctly; the second checks lines A8 to A15 and also the chip select logic.
Since this test also insures that the RAM can store data; why do the 55, AA check? Consider any individual address. At location 0, the addressing diagnostic will write 0 (the low address) and later another 0 (the high address). While addressability will be confirmed, some doubt remains about its ability to store data. The 55, AA check tests every bit.
The peril of these diagnostics is that the address lines cycle throughout all of memory as the test proceeds. If the refresh circuit has failed, most likely the test itself will keep DRAMs alive. This is the worst possible situation; the process of testing camouflages failures. A simple solution is to add a long delay after writing a pattern and before doing the read-back. This delay should be on the order of several seconds. It is also important to constrain the test code to a small area, so the CPU's instruction fetches don't create artificial refreshes.
In the bad old days small DRAMs manufacturing defects and alpha particles caused some memories to exhibit pattern sensitivity problems; selected cells would hold not hold a particular byte if a nearby cell held another specific byte. Elaborate tests were devised to isolate these problems. The "Walking Ones" test, in particular, burned an enormous amount of computer time and could find really complex pattern failures. Fortunately these sorts of problems just don't show up anymore.
Figure 1 is a routine that performs all of these tests. It is cumbersomely coded in 8088 assembly language, using no RAM at all. Simpler, prettier code using CALLs and RETs just will not be dependable, since it would rely on the very RAM we're testing.
While it is easy to see some justification for testing the product's RAM, ROM tests are perhaps not as obviously valuable. If the ROM is not working, how can it test itself? Physician - heal thyself (but not if you're in a coma). As always, a completely dead kernal, one that just doesn't even boot, cannot run diagnostics. If the boot ROM does at least partially work, then some testing is valuable.
In the boot ROM itself we can realistically expect to detect only a simple failure, like a partially programmed device, although with some luck it might be possible to find a shorted or open high order address line. Luck, because if the line floats or is tied to the wrong level, then the diagnostic code will not start.
ROMs are tedious to program - sometimes technicians will unknowingly, in their impatience, remove the chip before the programmer is completely done. If you do elect to include a ROM test, be sure to locate it early in the code so it stands a chance of executing even if the ROM is not entirely programmed. It's easier to make an argument for ROM testing in multiple ROM systems. If the boot ROM starts, then diagnostics located in it can test all of the others.
While the memories can fail in a number of ways, probably the electronics, you'll know that it can be awfully hard to tell if all pins are in the sockets. Other problems cover the usual range of broken circuit board tracks (i.e., address, data, control lines), misprogrammed devices, and non-functioning chip select lines.
One way to test ROMs is to read the devices and compare each byte to a known value. Since such redundancy is impractical, most programmers simply compute an 8 or 16 bit sum of the data in the ROMs and compare it to a known value. Usually this is adequate, but a number of pathological cases will report incorrect results. For example, a long string of zeroes will always checksum to zero, regardless of the number of items summed.
A much better approach than a simple checksum is the Cyclic Redundancy Check (CRC). The CRC is a polynomial that is seeded, typically with FFFF, and then divided into the input data (in this case the ROM data) a byte at a time, using the dividend at each step as the new seed. While mathematically complicated, the CRC is pretty easy to implement. Its great virtue is that each byte of repeated strings (say, zeroes) will yield a different CRC. The CRC is a bit harder to code than a simple checksum, but the code listed in figure 2 is a cookbook solution. Once again, it is written in 8088 assembly language to insure it uses the minimal number of as-yet-untested CPU resources.
A CRC or checksum test is easy to code and yields useful information, but is sometimes a nuisance to implement because the correct value must be computed and stored in ROM. No assembler or compiler has a built-in function to build this automatically, so generally you must run the program under an emulator, record the CRC or checksum the routine computes, and then manually patch the resulting value into an absolute location in ROM. The only way to automate this is to write a short program that CRCs the linker output file and patches the result into the ROM file.
It is best to address the diagnostics problem using the engineering thought processes our "significant others" have learned to hate. Examine the system dispassionately and analytically, looking for all possible failure modes. Study the tradeoffs. Look for alternate approaches. Then, implement the best possible solution that does a reasonable job of solving the problem yet doesn't take too much programming time. No one said it would be easy.
Next month I'll look at another aspect of diagnostics - how do you report an error? We'll also look at external diagnostics that help out the technician when the system won't even boot.
Figure 1: ********************************************************** title Ram test code code segment public assume cs:code,ds:data ; ; This routine performs a RAM test on an 80x88 type machine. ; ; This test is sequenced through the following phases: ; ; 55 written to all addresses and read back ; AA written to all addresses and read back ; Low order address written and read ; High order address written and read ; ; Note that this code makes no use of RAM, so some of its ; structure may seem arbitrarily cumbersome. ; ; The test is not coded as a traditional subroutine with a ; "return" instruction. Your code must flow into it - it ; leaves at exit point ram_done. ; ; On entry, the address to test is contained in register ; DS:BX. The length of the test is in CX. It does not ; support overflows into the segment register; be sure ; that BX+CX is no more than FFFF. ; ; On exit, if the test fails register AX will be set non-zero, ; and DS:BX will contain the address of the first failed byte. ; AX will be set to 55 if the 55 test failed, AA if the AA test ; failed, 1 if the low address test failed, and 2 if the high ; test failed. Register CX will contain the number of contiguous ; bytes that are bad; note that only one contiguous block is ; tagged. ; ; If the RAM passes the test, then AX will be 0. ; ram_test: mov si,bx ; save start address in si mov bp,cx ; save length in bp mov ax,55h ram_55_load: mov [bx],al ; load memory with 55s first inc bx loop ram_55_load mov bx,si ; restore start of test address mov cx,bp ; restore length of test dec bx ram_55_check: inc bx cmp [bx],al ; see if we still have 55s loope ram_55_check ; loop as long we get 55s back test cx,cx ; if zero, test passed jz ram_aa_test ; br if test passed mov dx,0 ; dx will be # bad bytes found for now mov si,bx ; set first bad address ram_55_bad: ; failure found - find length of bad block inc si inc dx ; inc # bad bytes found dec cx ; dec total # to test jz ram_bad1 ; end of test cmp [si],al ; see if this location is also bad jnz ram_55_bad ram_bad1: jmp ram_bad ; end of the bad block found ; ; Run the AA test ; ram_aa_test: mov bx,si ; start address of test mov cx,bp ; length of test mov ax,0aah ram_aa_load: mov [bx],al ; load memory with AAs first inc bx loop ram_aa_load mov bx,si ; restore start of test address mov cx,bp ; restore length of test dec bx ram_aa_check: inc bx cmp [bx],al ; see if we still have AAs loope ram_aa_check ; loop as long we get AAs back test cx,cx ; if zero, test passed jz ram_low_test ; br if test passed mov dx,0 ; dx will be # bad bytes found for now mov si,bx ; set first bad address ram_aa_bad: ; failure found - find length of bad block inc si inc dx ; inc # bad bytes found dec cx ; dec total # to test jz ram_bad ; end of test cmp [si],al ; see if this location is also bad jnz ram_aa_bad jmp ram_bad ; end of the bad block found ; ; Run the low address test ; ram_low_test: mov bx,si ; start address of test mov cx,bp ; length of test ram_low_load: mov [bx],bl ; load memory with low address inc bx loop ram_low_load ; ; Add in a delay of about 2-3 seconds to let DRAMs "forget" if ; refresh is not active. ; mov bx,si ; restore start of test address mov cx,bp ; restore length of test dec bx ram_low_check: inc bx cmp [bx],bl ; see if we still have low address loope ram_low_check ; loop as long we get good data back test cx,cx ; if zero, test passed jz ram_hi_test ; br if test passed mov dx,0 ; dx will be # bad bytes found for now mov si,bx ; set first bad address ram_low_bad: ; failure found - find length of bad block inc si inc dx ; inc # bad bytes found dec cx ; dec total # to test jz ram_bad ; end of test mov ax,si ; get address we stored cmp [si],al ; see if this location is also bad jnz ram_low_bad mov ax,1 ; 1 indicates low address failure jmp ram_bad ; end of the bad block found ; ; Run the high address test ; ram_hi_test: mov bx,si ; start address of test mov cx,bp ; length of test mov ax,0 ; assume test will pass ram_hi_load: mov [bx],bh ; load memory with high address inc bx loop ram_hi_load ; ; Add in a delay of about 2-3 seconds to let DRAMs "forget" if ; refresh is not active. ; mov bx,si ; restore start of test address mov cx,bp ; restore length of test dec bx ram_hi_check: inc bx cmp [bx],bh ; see if we still have high address loope ram_hi_check ; loop as long we get good data back test cx,cx ; if zero, test passed jz ram_done ; br if test passed mov dx,0 ; dx will be # bad bytes found for now mov si,bx ; set first bad address ram_hi_bad: ; failure found - find length of bad block inc si inc dx ; inc # bad bytes found dec cx ; dec total # to test jz ram_bad ; end of test mov ax,si ; get address we stored cmp [si],ah ; see if this location is also bad jnz ram_hi_bad mov ax,2 ; 2 indicates high address failure jmp ram_bad ; end of the bad block found ram_bad: mov cx,dx ; cx=# bad bytes ; ax=test number ; bx=start of bad address ram_done: ; exit point code ends data segment public data ends end ******************************************************** Figure 2 follows: ******************************************************** title CRC program code segment public assume cs:code,ds:data ; ; this routine will compute a CRC of a block of ; data starting at the address in DS:BX with the ; length in CX. Don't try to exceed a 64k segment (keep ; BX+CX <= FFFF). ; ; The CRC will be computed into register DX and compared to ; a value saved in ROM. ; rom_test: mov dx,0ffffh ; initialize CRC to -1 rom_loop: mov al,[bx] ; get a character to CRC inc bx ; pt to next value xor al,dl ; compute crc mov ah,al ; save temp result shr al,1 ; shift right 4 shr al,1 shr al,1 shr al,1 xor al,ah ; xor temp with partial product mov ah,al ; new temp shl al,1 ; shift left 4 shl al,1 shl al,1 shl al,1 xor al,dh ; combine with high crc mov dl,al ; save low result mov al,ah shr al,1 ; shift right 3 shr al,1 shr al,1 xor al,dl mov dl,al mov al,ah ; get temp back shl al,1 ; shift left 5 shl al,1 shl al,1 shl al,1 shl al,1 xor al,ah mov dh,al ; high crc result dec cx ; dec data byte count jnz rom_loop ; loop till all CRCed cmp dx,word ptr cs:crc; crc match? jnz rom_error ; error if no match jmp rom_ok ; jmp if ok crc: dw 0 ; save crc here rom_error: ; error location - flag an error rom_ok: ; rom crc compare ok code ends data segment public data ends end