Thanks for the Memories
|For novel ideas about building embedded systems (both hardware and firmware), join the 25,000+ engineers who subscribe to The Embedded Muse, a free biweekly newsletter. The Muse has no hype, no vendor PR. It takes just a few seconds (just enter your email, which is shared with absolutely no one) to subscribe.|
By Jack Ganssle
It doesn't take much to make at least the kernel of an embedded system run. With a working CPU chip, memories that do their thing, perhaps a dash of decoder logic, you can count on the code starting off... perhaps not crashing until running into a problem with I/O.
Though the kernel may be relatively simple, with the exception of the system's power supply it's by far the most intolerant portion of an embedded system to any sort of failure. The tiniest glitch, a single bit failure in a huge memory array, or any problem with the processor pretty much guarantees that nothing in the system stands a change of running.
Non-kernel failures may not be so devastating. Some I/O troubles will cause just part of the system to degrade, leaving much of the rest up. My car's black box seems to have forgotten how to run the cruise control, yet it still keeps the fuel injection and other systems running.
In the minicomputer era most booted with a CPU test that checked each instruction. That level of paranoia is not longer appropriate, as a highly integrated CPU will generally fail disastrously. If the processor can execute any sort of a self test, it's pretty much guaranteed to be intact.
Dead decoder logic is just as catastrophic. No code will execute if the ROMs can't be selected.
A smart technician can spot a dead decoder in a heartbeat using not much more than a scope. He can make a pretty good guess that the processor is history by looking for bizarre outputs (no clock-out; no read/write; tristated address lines right after reset), or by "shotgunning"; replacing the chip with a known-good one and seeing if the problems disappear.
Large memory arrays, though, can suffer from partial failures that are just about impossible to troubleshoot. A defective RAM is tough to find by any method other than shotgunning. A handful of bad locations in ROM are equally difficult to detect.
Lots of designers realize that memories are a potential source of trouble, so include diagnostics in the firmware. Good idea! Given that there's no realistic way for a technician to find a memory problem, a little software designed to pick up these will sure make you friends in the test department.
If your boot ROM is totally misprogrammed or otherwise non-functional, then there's no way a ROM test will do anything other than crash. The value of a ROM test is limited to dealing with partially programmed devices (due, perhaps, to incomplete erasure, or inadvertently removing the device before completion of programming).
There's a small chance that ROM tests will pick up an addressing problem, if you're lucky enough to have a failure that leaves the boot and ROM test working. The odds are against it, and somehow Mother Nature tends to be very perverse.
Some developers feel that a ROM checksum makes sense to insure the correct device is inserted. This works best only if the checksum is stored outside of the ROM under test. Otherwise, inserting a device with the wrong code version will not show an error, as presumably the code will match the (also obsolete) checksum.
In multiple-ROM systems a checksum test can indeed detect misprogrammed devices, assuming the test code lives in the boot ROM. If this one device functions, and you write the code so that it runs without relying on any other ROM, then the test will pick up many errors.
Checksums, though, are passé. It's pretty easy for a couple of errors to cancel each other out. Compute a CRC (Cyclic Redundancy Check), a polynomial with terms fed back at various stages. CRCs are notoriously misunderstood but are really quite easy to implement. The best reference I have seen to date is "A Painless Guide to CRC Error Detection Algorithms", by Ross Williams. It's available via anonymous FTP from ftp.adelaide.edu.au/pub/rocksoft/crc_v3.txt.
It's not a bad idea to add death traps to your ROM. On a Z80 0xff is a call to location 38. Conveniently, unprogrammed areas of ROMs are usually just this value. Tell your linker to set all unused areas to 0xff; then, if an address problem shows up, the system will generate lots of spurious calls. Sure, it'll trash the stack, but since the system is seriously dead anyway, who cares? Technicians can see the characteristic double write from the call, and can infer pretty quickly that the ROM is not working.
Other CPUs have similar instructions. Browse the op code list with a creative mind.
The days of erratic single bit RAM failures are thankfully gone. Once DRAMs were subject to cosmic ray and even alpha particle problems, so designers came up with exhaustive tests that insured no bit could interact with any other bit within each chip.
New packaging materials cured these problems once the chip vendors discovered that the plastic material used to encapsulate the silicon was one of the biggest sources of alpha particles. Now it seems most RAM failures stem from good old-fashioned electrical and logic problems.
RAMs fail outright, just as any other part does. Rarely is a single bit bad; generally the entire device, or a least some number of rows or columns, die. (All memories are organized as matrices; each row and column includes a driver and a sense amplifier that converts the minuscule voltage from memory cells into conventional logic signals. These amplifiers do fail, cause complete loss of data from that row or column).
Decoders die, preventing the selection of entire RAM devices. Address and data lines may not make it to the chips, or the write signal may just peter out on its way across the board.
All of these problems result in fairly massive access problems. An effective RAM test need not check every possible state of the array, as long as it tests pretty much every location. This simplification results in a huge decrease in the time a RAM test will take to run.
Clearly, any such test cannot require working RAM. In the worst case, where none of the memory works at all, a test that uses CALLs and RETURNs will simply crash horribly at the first RETURN. This has several implications:
- You cannot code the test in C. The code produced by your compiler is difficult to control, and will doubtless use plenty of CALLs, RETURNs, PUSHes, and POPs.
- The RAM test code must be very early in the program - before any more complex activity that requires a functioning stack. Interrupts always make use of the stack, so be sure these are disabled!
- The test itself cannot use subroutines, variables not in registers, or the stack.
These restrictions induce many to use only the simplest of tests. It's common to write 0x55 to each location, read and check the result, and then repeat the process using 0xaa. These two values are each others complement, so at least every bit gets tested.
If any or all of the address lines in the system are hosed this test will pass. Bummer, that, but since every value in RAM is set to the same value, you'll never know if you are reading location 0100 instead of 0000.
An alternative is to follow the 0x55, 0xaa test with something that picks up address problems. Try writing the low part of the address to each RAM location, over the entire array, and then reading the memory to test for correctness. For example, write 00 to 0000, 01 to 0001, 02 to 0002, etc. The address, or at least part of it, is encoded into the data, so you can be pretty sure that the RAMs decode properly.
On an 8 bit computer each location is byte-addressable, so at location 0100 the pattern restarts at 00. That is, the test writes the same data to 0000, 0100, 0200, etc. Upper address line shorts may not be detected.
Again, add another test. Write the upper part of the address to each RAM location. 00 goes to 0000 through 00ff. Put 01 in 0100 to 01ff, and 02 in 0200 to 02ff.
For arrays up to 64k in length, then, running these four tests insures that each bit works, and each cell addresses properly. The code is quite simple and easily written without using intermediate variables or the stack. The only downside is that testing large arrays can take a long time: the code writes to every location 4 times, and then reads each 4 times. Even on a lousy 64k RAM this is a half million accesses, each one burdened with all of the housekeeping code needed to sequence the comparisons.
A faster test will write and read the array just once. Given that we don't expect single bit errors, there's no need to make sure we put a 0 and a 1 in each location as we did with the 0x55 and 0xaa tests.
A fast test must send a reasonable set of different values to memory to make sure that the array is really writable. It must be clever enough detect addressing problems, a common source of trouble due to the vast number of address lines running over the circuit board, and the likelihood that one or more may be corrupt in some manner.
A very fast, very simple solution is to create a short string of almost random bytes that you repeatedly send to the array until all of memory is written. Then, read the array and compare against the original string.
I use the phrase "almost random" facetiously, but in fact it little matters what the string is, as long as it contains a variety of values. It's best to be include the pathological cases, like 00, 0xaa, ox55, and 0xff. The string is something you pick when writing the code, so it is truly not random, but other than these four specific values you fill the rest of it with nearly any set of values, since we're just checking basic write/read functions (remember: memory tends to fail in fairly dramatic ways). I like to use very orthogonal values - those with lots of bits changing between successive string members - to create big noise spikes on the data lines.
To make sure this test picks up addressing problems, insure the string's length is not a factor of the length of the memory array. In other words, you don't want the string to be aligned on the same low-order addresses, which might cause an address error to go undetected.
For 64k of RAM, a string 257 bytes long is perfect. 257 is prime, and its square is greater than the size of the RAM array. Each instance of the string will start on a different low order address.
257 has another special magic: you can include every byte value (00 to 0xff) in the string without effort. You can skip the actual creation of a string in ROM by producing the values as needed, incrementing a counter that overflows at 8 bits.
To summarize this algorithm: set an 8 bit counter to 0, and the start address to the beginning of RAM. Write the counter's value to RAM. Increment it, and repeat until 257 locations were written. Now reset the counter to 0 and iterate until all of RAM is done.
Reset the counter to 0 and the address to the start of RAM and repeat, this time reading instead of writing, checking each memory location against the counter value.
Some folks skip the read and compare step, instead reading and checksumming or CRCing the data. This may be marginally faster, but you cannot tell where the failure occurred unless you stop CRCing at the end of each 257 byte block, and make the comparison there.
When speed is a major concern modify the algorithm by skipping most of memory. Instead of incrementing the address at each step, add a small prime number to the address. You'll test a lot less of the RAM so may potentially miss some failures, but if the prime number address offset is much smaller than the row and column sizes of the RAM chips then you'll surely pick up most commons problems.
DRAMs have memories rather like mine - after 2 to 4 milliseconds go by they will probably forget unless external circuitry nudges them with a gentle reminder. This is known as "refreshing" the devices, and is a critical part of every DRAM-based circuit extant.
More and more processors include built-in refresh generators, but plenty of others still rely on rather complex external circuitry. Any failure in the refresh system is a disaster.
Any RAM test should pick up a refresh fault - shouldn't it? After all, it will surely take a lot longer than 2-4 msec to write out all of the test values to even a 64k array.
Unfortunately, refresh is basically the process of cycling address lines to the DRAMs. A completely dead refresh system won't show up with the test indicated, since the processor will be merrily cycling address lines like crazy as it writes and reads the devices. There's no chance the test will find the problem. This is the worst possible situation: the process of running the test camouflages the failure!
The solution is simple: after writing to all of memory, just stop toggling those pesky address lines for a while. Run a tight do-nothing loop for a while (very tight.... the more instructions you execution per iteration, the more address lines will toggle), and only then do the read test. Reads will fail if the refresh logic isn't doing its thing.
Though DRAMs are typically spec'ed at a 2-4 msec maximum refresh interval, some hold their data for surprisingly long times. When memories were smaller and cells larger, each had so much capacitance you could sometimes go for dozens of seconds without losing a bit. Today's smaller cells are less tolerant of refresh problems, so a 1 to 2 second delay is probably adequate.
Capacitance causes another insidious problem that is easy to deal with: the read that follows a write to a location that doesn't exist (perhaps due to a completely dead RAM) will often return correct data! Follow the algorithm above and write all of memory before starting the read - capacitance can remember but a single value, not the complex sequence you've written.
Including system tests is a good idea if, and only if, the test has more meaning than just adding a "Includes Full Diagnostics" line to the marketing blurbs. Good algorithms are as easy to implement as poor ones - just think the failure modes through carefully, before writing a lot of useless code.