The Zen of Diagnostics

This is the first of a two part series about adding diagnostics into your programs.

Published in Embedded Systems Programming, June 1990

By Jack Ganssle

For some inexplicable reason most of us embedded programmers rarely concern ourselves with making a product manufacturable. Sure, we work like slaves meeting performance goals, but the product must do more than function correctly - it must be designed to be producable. While our hardware brethren tune their designs to meet cost, manufacturing, and performance goals, we work in relative isolation; a sort of black hole where few dare to tread.

In high school I worked for a while in a machine shop, serving as the lowest sort of helper to highly skilled machinists making parts for the space program. When they discovered I intended to go to college to become an engineer, one grizzled veteran warned me to not "design something we can't make". Now, twenty years later, I have to admit that these words, so casually and freely given, have been more important than most of the EE courses I struggled through later. Designing the most fantastic widget ever conceived is suicidal if it cannot be made and marketed at a profit.

In a typical manufacturing operation, boards are stuffed, assembled into units, tested and perhaps repaired. While the code doesn't impact board stuffing and assembly operations, it can strongly influence product test and repair. Smart designers will produce a product that easily fits into the company's manufacturing operation; software engineers can contribute by writing code that speeds the daily grind of production test.

All employees are hopefully working towards common corporate goals, yet each has a different vision of the company's needs and problems. To a programmer the word "testing" conjures images of correctness proofs, exhaustive software trials, and code coverage analysis. A production person probably has never heard of any of these concepts; he looks at testing as the daily routine of ensuring each and every unit works correctly before being shipped.

Conventional software test is a one time event; once the product is complete it is over, forever (well, would you believe...). Product test goes on every day. Very complex products are tested and repaired by technicians with little formal computer training. The best are usually culled and assigned to work in engineering support, leaving production with workers who may be skilled but who are certainly not rocket scientists. As software engineers it is our responsibility to the company to give the techs the tools they need to ship product. As software managers, it is our responsibility to convince management that this is an important and desirable goal.

Internal Diagnostics

Quite a few embedded systems include diagnostics as part of the product's ROM to give a sort of "go/no-go" indication without using other test equipment. The unit's own display or status lamps show test results.

Internal diagnostics are worthwhile because they do give the test technician some ability to track down problems. They're also an effective marketing tool, giving the customer a (possibly false) feeling of confidence in the integrity of the product each time he turns it on.

Though internal diagnostics are often viewed as a universal solution to finding system problems, their value lies more in giving a crude test of limited system functions. Not that this isn't valuable. Internal diagnostics can test quite a bit of the unit's I/O and some of the "kernal", or the CPU, RAM, and ROM areas.

The computer's kernal frequently defies standalone testing, since so much of it must be functional for the diagnostics to run at all. Most systems couple at least the main ROM and RAM closely to the processor. The result - a single address, data, or control line short prevents the program from running at all.

It's easy to waste a lot of time coding internal diagnostics that will never provide useful information. They may satisfy vague marketing promises of self-testing capability, but why write dishonest code? Realize that internal diagnostics have intrinsic limitations, but if carefully designed can yield some valuable information. Apply your engineering expertise to the diagnostic problem; carefully analyze the tradeoffs and identify reasonable ways to find at least some of the common hardware failures. The first step is to separate the tests into kernal (CPU, RAM, ROM, and decoders) and beyond-kernal (I/O) tests. Then consider the most likely failure modes; try to design tests that will first survive the failure, and second will identify and report it. Yes, the kernal tests will never be very robust since lots of hardware glitches will prevent the program from running at all. But, if carefully designed, they'll really help your buddies in production.

I/O tests can run the gamut of a simple LED blinking routine to A/D and D/A loopbacks that check converter linearity. I/O is just too big of a subject to address in a short article. Today's range of peripherals is so mind boggling that several large books might not adequately cover the subject of testing even the most common devices. I won't attempt to delve into a discussion of I/O tests here.

What portions of the kernal should be tested? Some programmers have a tendency to test the CPU chip itself, running through a sequence of instructions "guaranteed" to prove that this essential chip is operating properly. Witness the ubiquitous PC's BIOS CPU tests. I wonder just how often failures are detected. On the PC an instruction test failure makes the code execute a HALT, causing the CPU to look just as dead as if the it never started. More extensive error reporting with a defective CPU is a fool's dream. Instruction tests stem from minicomputer days, where hundreds of discrete ICs were needed to implement the CPU; a single chip failure might only shut down a small portion of the processor. Today's highly integrated parts tend to either work or not; partial failures are pretty rare.

Similarly, memory tests rely on operating memory to run - a high tech oxymoron that makes one question their value. Obviously, if the ROM is not functioning, then the program won't start and the diagnostic will not be invoked. Or, if the diagnostic is coded with subroutine calls, a RAM (read "stack") failure will prematurely crash the test, before providing any useful information.

The moral is to design diagnostics so to ensure that each test uses as few unproven components as possible.

A carefully engineered RAM test can be quite valuable. In a multi-ROM system, it (and all the diagnostics) should be stored in the boot ROM, preferably near the reset address. The low order address lines must operate to run even a trivial program, and enough of the upper ones must work to select the boot ROM; particularly in a small system with undecoded address lines, try to write test routines that rely on a minimal number of working address wires.

RAM tests commonly write out a pattern of alternating ones and zeroes, read the data back, and repeat the test using data that is the complement of the first set. This amateurish approach reflects poor analysis of the problem. Before writing code, put on your hardware hat (or get some help from another member of the design team) and consider the most likely failure modes. Tailor the test to identify all or most of these problems. A typical list includes:

Address/data line shorts - In nine times out of ten these problems will crash the diagnostic. Sometimes RAM is isolated from the kernal by buffers; in this case
post-buffer problems can be found.
No chip select - One or more signals may be needed to turn each device on. The chip select complexity can range from a simple undecoded address line to a nightmarish spaghetti of PALs and logic. Regardless, every RAM must receive a proper chip select to function.
Pin not inserted - Socketed RAM devices may not be properly seated. Sometimes a pin bends under the device.
Bad device - The semiconductor vendors do a wonderful job of delivering functioning chips. Rarely, though, a bad one may slip through (or, through mishandling, one may "toggle to the bad state"). Device geometry is now so small that it is unusual to see the pattern sensitivity that once plagued DRAMs. Usually the entire chip is just plain bad, making it a lot easier to identify problems.
Multiple addressing - This is a variant of the chip select problem. If more than one memory device is used, several can sometimes be turned on at once.
Refresh - Dynamic RAMs require a periodic accesses to keep the memory alive. This Refresh signal is generated by external logic, or by the CPU itself. Sometimes the refresh circuitry fails. The entire memory array must be refreshed every few milliseconds to stay within the chips' specifications, but it's surprising just how long DRAMs can remember their data after loosing refresh. One to two seconds seems to be the extreme limit.

With the above in mind, we can design a routine to test RAMs. The first criteria is that the test itself certainly cannot make use of RAM!

Several of the failure modes manifest themselves by the inability to store data. For example, data bus problems, a bad device, chip select failures, or an incorrectly inserted pin will usually exhibit a simple read/write error. The traditional write and read of a 55, followed by an AA, will find these problems quickly.

Writing a pattern of 55s and AAs tests the ability of the devices to hold data, but it doesn't ensure that the RAMs are being addressed correctly. Examples of failures that could pass this simple test are: a post-buffer address short, an open address line (say, from a pin not being inserted properly), or chip select failures causing multiple addressing. It's important to run a second routine that isolates these not-uncommon problems.

An addressing test works by writing a unique data value to each location, and then reading the memory to see that that value is still stored. An easy-to-compute pattern is the low order address of the location; at 100 store 00, at 101 store a 01, etc. This isn't really unique, since an 8 bit location can only store 256 different values. If we repeat the test, using the locations' high order address bytes as the data pattern, then (for memory sizes to 64k) after two passes the entire array will be tested uniquely. The first test insures that address lines A0 to A7 function correctly; the second checks lines A8 to A15 and also the chip select logic.

Since this test also insures that the RAM can store data; why do the 55, AA check? Consider any individual address. At location 0, the addressing diagnostic will write 0 (the low address) and later another 0 (the high address). While addressability will be confirmed, some doubt remains about its ability to store data. The 55, AA check tests every bit.

The peril of these diagnostics is that the address lines cycle throughout all of memory as the test proceeds. If the refresh circuit has failed, most likely the test itself will keep DRAMs alive. This is the worst possible situation; the process of testing camouflages failures. A simple solution is to add a long delay after writing a pattern and before doing the read-back. This delay should be on the order of several seconds. It is also important to constrain the test code to a small area, so the CPU's instruction fetches don't create artificial refreshes.

In the bad old days small DRAMs manufacturing defects and alpha particles caused some memories to exhibit pattern sensitivity problems; selected cells would hold not hold a particular byte if a nearby cell held another specific byte. Elaborate tests were devised to isolate these problems. The "Walking Ones" test, in particular, burned an enormous amount of computer time and could find really complex pattern failures. Fortunately these sorts of problems just don't show up anymore.

Figure 1 is a routine that performs all of these tests. It is cumbersomely coded in 8088 assembly language, using no RAM at all. Simpler, prettier code using CALLs and RETs just will not be dependable, since it would rely on the very RAM we're testing.

While it is easy to see some justification for testing the product's RAM, ROM tests are perhaps not as obviously valuable. If the ROM is not working, how can it test itself? Physician - heal thyself (but not if you're in a coma). As always, a completely dead kernal, one that just doesn't even boot, cannot run diagnostics. If the boot ROM does at least partially work, then some testing is valuable.

In the boot ROM itself we can realistically expect to detect only a simple failure, like a partially programmed device, although with some luck it might be possible to find a shorted or open high order address line. Luck, because if the line floats or is tied to the wrong level, then the diagnostic code will not start.

ROMs are tedious to program - sometimes technicians will unknowingly, in their impatience, remove the chip before the programmer is completely done. If you do elect to include a ROM test, be sure to locate it early in the code so it stands a chance of executing even if the ROM is not entirely programmed. It's easier to make an argument for ROM testing in multiple ROM systems. If the boot ROM starts, then diagnostics located in it can test all of the others.

While the memories can fail in a number of ways, probably the electronics, you'll know that it can be awfully hard to tell if all pins are in the sockets. Other problems cover the usual range of broken circuit board tracks (i.e., address, data, control lines), misprogrammed devices, and non-functioning chip select lines.

One way to test ROMs is to read the devices and compare each byte to a known value. Since such redundancy is impractical, most programmers simply compute an 8 or 16 bit sum of the data in the ROMs and compare it to a known value. Usually this is adequate, but a number of pathological cases will report incorrect results. For example, a long string of zeroes will always checksum to zero, regardless of the number of items summed.

A much better approach than a simple checksum is the Cyclic Redundancy Check (CRC). The CRC is a polynomial that is seeded, typically with FFFF, and then divided into the input data (in this case the ROM data) a byte at a time, using the dividend at each step as the new seed. While mathematically complicated, the CRC is pretty easy to implement. Its great virtue is that each byte of repeated strings (say, zeroes) will yield a different CRC. The CRC is a bit harder to code than a simple checksum, but the code listed in figure 2 is a cookbook solution. Once again, it is written in 8088 assembly language to ensure it uses the minimal number of as-yet-untested CPU resources.

A CRC or checksum test is easy to code and yields useful information, but is sometimes a nuisance to implement because the correct value must be computed and stored in ROM. No assembler or compiler has a built-in function to build this automatically, so generally you must run the program under an emulator, record the CRC or checksum the routine computes, and then manually patch the resulting value into an absolute location in ROM. The only way to automate this is to write a short program that CRCs the linker output file and patches the result into the ROM file.

In Conclusion

It is best to address the diagnostics problem using the engineering thought processes our "significant others" have learned to hate. Examine the system dispassionately and analytically, looking for all possible failure modes. Study the tradeoffs. Look for alternate approaches. Then, implement the best possible solution that does a reasonable job of solving the problem yet doesn't take too much programming time. No one said it would be easy.

Next month I'll look at another aspect of diagnostics - how do you report an error? We'll also look at external diagnostics that help out the technician when the system won't even boot.

Figure 1:
**********************************************************
	title	Ram test code
code	segment public
	assume	cs:code,ds:data
;
;  This routine performs a RAM test on an 80x88 type machine.
;
;  This test is sequenced through the following phases:
;
;	55 written to all addresses and read back
;	AA written to all addresses and read back
;	Low order address written and read
;	High order address written and read
;
;  Note that this code makes no use of RAM, so some of its
; structure may seem arbitrarily cumbersome.
;
; The test is not coded as a traditional subroutine with a
; "return" instruction. Your code must flow into it - it 
; leaves at exit point ram_done.
;
; On entry, the address to test is contained in register
; DS:BX. The length of the test is in CX. It does not
; support overflows into the segment register; be sure
; that BX+CX is no more than FFFF.
;
; On exit, if the test fails register AX will be set non-zero,
; and DS:BX will contain the address of the first failed byte.
; AX will be set to 55 if the 55 test failed, AA if the AA test
; failed, 1 if the low address test failed, and 2 if the high 
; test failed. Register CX will contain the number of contiguous
; bytes that are bad; note that only one contiguous block is 
; tagged.
;
; If the RAM passes the test, then AX will be 0.
;
ram_test:
	mov	si,bx		; save start address in si
	mov	bp,cx		; save length in bp
	mov	ax,55h
ram_55_load:
	mov	[bx],al		; load memory with 55s first
	inc	bx
	loop	ram_55_load
	mov	bx,si		; restore start of test address
	mov	cx,bp		; restore length of test
	dec	bx
ram_55_check:
	inc	bx
	cmp	[bx],al		; see if we still have 55s
	loope	ram_55_check	; loop as long we get 55s back
	test	cx,cx		; if zero, test passed
	jz	ram_aa_test	; br if test passed
	mov	dx,0		; dx will be # bad bytes found for now
	mov	si,bx		; set first bad address
ram_55_bad:			; failure found - find length of bad block
	inc	si
	inc	dx		; inc # bad bytes found
	dec	cx		; dec total # to test
	jz	ram_bad1	; end of test
	cmp	[si],al		; see if this location is also bad
	jnz	ram_55_bad
ram_bad1:
        jmp     ram_bad         ; end of the bad block found
;
; Run the AA test
;
ram_aa_test:
	mov	bx,si		; start address of test
	mov	cx,bp		; length of test
	mov	ax,0aah
ram_aa_load:
	mov	[bx],al 	; load memory with AAs first
	inc	bx
	loop	ram_aa_load
	mov	bx,si		; restore start of test address
	mov	cx,bp		; restore length of test
	dec	bx
ram_aa_check:
	inc	bx
	cmp	[bx],al 	; see if we still have AAs
	loope	ram_aa_check	; loop as long we get AAs back
	test	cx,cx		; if zero, test passed
	jz	ram_low_test	; br if test passed
	mov	dx,0		; dx will be # bad bytes found for now
	mov	si,bx		; set first bad address
ram_aa_bad:			; failure found - find length of bad block
	inc	si
	inc	dx		; inc # bad bytes found
	dec	cx		; dec total # to test
	jz	ram_bad 	; end of test
	cmp	[si],al		; see if this location is also bad
	jnz	ram_aa_bad
	jmp	ram_bad 	; end of the bad block found
;
; Run the low address test
;
ram_low_test:
	mov	bx,si		; start address of test
	mov	cx,bp		; length of test
ram_low_load:
	mov	[bx],bl 	; load memory with low address
	inc	bx
	loop	ram_low_load
;
; Add in a delay of about 2-3 seconds to let DRAMs "forget" if
; refresh is not active.
;
	mov	bx,si		; restore start of test address
	mov	cx,bp		; restore length of test
	dec	bx
ram_low_check:
	inc	bx
	cmp	[bx],bl 	; see if we still have low address
	loope	ram_low_check	; loop as long we get good data back
	test	cx,cx		; if zero, test passed
	jz	ram_hi_test	; br if test passed
	mov	dx,0		; dx will be # bad bytes found for now
	mov	si,bx		; set first bad address
ram_low_bad:			; failure found - find length of bad block
	inc	si
	inc	dx		; inc # bad bytes found
	dec	cx		; dec total # to test
	jz	ram_bad 	; end of test
	mov	ax,si		; get address we stored
	cmp	[si],al 	; see if this location is also bad
	jnz	ram_low_bad
	mov	ax,1		; 1 indicates low address failure
	jmp	ram_bad 	; end of the bad block found
;
; Run the high address test
;
ram_hi_test:
	mov	bx,si		; start address of test
	mov	cx,bp		; length of test
	mov	ax,0		; assume test will pass
ram_hi_load:
	mov	[bx],bh 	; load memory with high address
	inc	bx
	loop	ram_hi_load
;
; Add in a delay of about 2-3 seconds to let DRAMs "forget" if
; refresh is not active.
;
	mov	bx,si		; restore start of test address
	mov	cx,bp		; restore length of test
	dec	bx
ram_hi_check:
	inc	bx
	cmp	[bx],bh 	; see if we still have high address
	loope	ram_hi_check	; loop as long we get good data back
	test	cx,cx		; if zero, test passed
	jz	ram_done	; br if test passed
	mov	dx,0		; dx will be # bad bytes found for now
	mov	si,bx		; set first bad address
ram_hi_bad:			; failure found - find length of bad block
	inc	si
	inc	dx		; inc # bad bytes found
	dec	cx		; dec total # to test
	jz	ram_bad 	; end of test
	mov	ax,si		; get address we stored
	cmp	[si],ah 	; see if this location is also bad
	jnz	ram_hi_bad
	mov	ax,2		; 2 indicates high address failure
	jmp	ram_bad 	; end of the bad block found
ram_bad:      
	mov	cx,dx		; cx=# bad bytes
				; ax=test number
				; bx=start of bad address
ram_done:			; exit point
code	ends
data	segment public
data	ends
	end
********************************************************
Figure 2 follows:
********************************************************
	title	CRC program
code	segment	public
	assume	cs:code,ds:data
;
;  this routine will compute a CRC of a block of
; data starting at the address in DS:BX with the 
; length in CX. Don't try to exceed a 64k segment (keep
; BX+CX <= FFFF).
;
;   The CRC will be computed into register DX and compared to
; a value saved in ROM.
;
rom_test:
	mov	dx,0ffffh	; initialize CRC to -1
rom_loop:
	mov	al,[bx]		; get a character to CRC
	inc	bx		; pt to next value
	xor	al,dl		; compute crc
	mov	ah,al		; save temp result
	shr	al,1		; shift right 4
	shr	al,1
	shr	al,1
	shr	al,1
	xor	al,ah		; xor temp with partial product
	mov	ah,al		; new temp
	shl	al,1		; shift left 4
	shl	al,1
	shl	al,1
	shl	al,1
	xor	al,dh		; combine with high crc
	mov	dl,al		; save low result
	mov	al,ah
	shr	al,1		; shift right 3
	shr	al,1
	shr	al,1
	xor	al,dl
	mov	dl,al
	mov	al,ah		; get temp back
	shl	al,1		; shift left 5
	shl	al,1
	shl	al,1
	shl	al,1
	shl	al,1
	xor	al,ah
	mov	dh,al		; high crc result
	dec	cx		; dec data byte count
	jnz	rom_loop	; loop till all CRCed
	cmp	dx,word ptr cs:crc; crc match?
	jnz	rom_error	; error if no match
	jmp	rom_ok		; jmp if ok
crc:	dw	0		; save crc here
rom_error:			; error location - flag an error
rom_ok:				; rom crc compare ok 
code	ends
data	segment	public
data	ends
	end