Thanks for the Memories
Here's some advice about testing RAM and ROMs
in your embedded system.
 |
For hints, tricks and ideas about better ways to build embedded systems, subscribe to The Embedded Muse, a free biweekly e-newsletter. No hype, just down to earth embedded talk. 23,000 other engineers subscribe. It takes just a few seconds (all we need is your email address, which is shared with absolutely no one) to subscribe to the Embedded Muse. |
It doesn't take much to make at least the kernel of an embedded
system run. With a working CPU chip, memories that do their thing,
perhaps a dash of decoder logic, you can count on the code starting
off... perhaps not crashing until running into a problem with
I/O.
Though the kernel may be relatively simple, with the exception
of the system's power supply it's by far the most intolerant portion
of an embedded system to any sort of failure. The tiniest glitch,
a single bit failure in a huge memory array, or any problem with
the processor pretty much guarantees that nothing in the system
stands a change of running.
Non-kernel failures may not be so devastating. Some I/O troubles
will cause just part of the system to degrade, leaving much of
the rest up. My car's black box seems to have forgotten how to
run the cruise control, yet it still keeps the fuel injection
and other systems running.
In the minicomputer era most booted with a CPU test that checked
each instruction. That level of paranoia is not longer appropriate,
as a highly integrated CPU will generally fail disastrously. If
the processor can execute any sort of a self test, it's pretty
much guaranteed to be intact.
Dead decoder logic is just as catastrophic. No code will execute
if the ROMs can't be selected.
A smart technician can spot a dead decoder in a heartbeat using
not much more than a scope. He can make a pretty good guess that
the processor is history by looking for bizarre outputs (no clock-out;
no read/write; tristated address lines right after reset), or
by "shotgunning"; replacing the chip with a known-good
one and seeing if the problems disappear.
Large memory arrays, though, can suffer from partial failures
that are just about impossible to troubleshoot. A defective RAM
is tough to find by any method other than shotgunning. A handful
of bad locations in ROM are equally difficult to detect.
Lots of designers realize that memories are a potential source
of trouble, so include diagnostics in the firmware. Good idea!
Given that there's no realistic way for a technician to find a
memory problem, a little software designed to pick up these will
sure make you friends in the test department.
Testing ROM
If your boot ROM is totally misprogrammed or otherwise non-functional,
then there's no way a ROM test will do anything other than crash.
The value of a ROM test is limited to dealing with partially programmed
devices (due, perhaps, to incomplete erasure, or inadvertently
removing the device before completion of programming).
There's a small chance that ROM tests will pick up an addressing
problem, if you're lucky enough to have a failure that leaves
the boot and ROM test working. The odds are against it, and somehow
Mother Nature tends to be very perverse.
Some developers feel that a ROM checksum makes sense to insure
the correct device is inserted. This works best only if the checksum
is stored outside of the ROM under test. Otherwise, inserting
a device with the wrong code version will not show an error, as
presumably the code will match the (also obsolete) checksum.
In multiple-ROM systems a checksum test can indeed detect misprogrammed
devices, assuming the test code lives in the boot ROM. If this
one device functions, and you write the code so that it runs without
relying on any other ROM, then the test will pick up many errors.
Checksums, though, are passé. It's pretty easy for a couple
of errors to cancel each other out. Compute a CRC (Cyclic Redundancy
Check), a polynomial with terms fed back at various stages. CRCs
are notoriously misunderstood but are really quite easy to implement.
The best reference I have seen to date is "A Painless Guide
to CRC Error Detection Algorithms", by Ross Williams. It's
available via anonymous FTP from ftp.adelaide.edu.au/pub/rocksoft/crc_v3.txt.
It's not a bad idea to add death traps to your ROM. On a Z80 0xff
is a call to location 38. Conveniently, unprogrammed areas of
ROMs are usually just this value. Tell your linker to set all
unused areas to 0xff; then, if an address problem shows up, the
system will generate lots of spurious calls. Sure, it'll trash
the stack, but since the system is seriously dead anyway, who
cares? Technicians can see the characteristic double write from
the call, and can infer pretty quickly that the ROM is not working.
Other CPUs have similar instructions. Browse the op code list
with a creative mind.
Testing RAM
The days of erratic single bit RAM failures are thankfully gone.
Once DRAMs were subject to cosmic ray and even alpha particle
problems, so designers came up with exhaustive tests that insured
no bit could interact with any other bit within each chip.
New packaging materials cured these problems once the chip vendors
discovered that the plastic material used to encapsulate the silicon
was one of the biggest sources of alpha particles. Now it seems
most RAM failures stem from good old-fashioned electrical and
logic problems.
RAMs fail outright, just as any other part does. Rarely is a single
bit bad; generally the entire device, or a least some number of
rows or columns, die. (All memories are organized as matrices;
each row and column includes a driver and a sense amplifier that
converts the minuscule voltage from memory cells into conventional
logic signals. These amplifiers do fail, cause complete loss of
data from that row or column).
Decoders die, preventing the selection of entire RAM devices.
Address and data lines may not make it to the chips, or the write
signal may just peter out on its way across the board.
All of these problems result in fairly massive access problems.
An effective RAM test need not check every possible state of the
array, as long as it tests pretty much every location. This simplification
results in a huge decrease in the time a RAM test will take to
run.
Clearly, any such test cannot require working RAM. In the worst
case, where none of the memory works at all, a test that uses
CALLs and RETURNs will simply crash horribly at the first RETURN.
This has several implications:
- You cannot code the test in C. The code produced by your compiler
is difficult to control, and will doubtless use plenty of CALLs,
RETURNs, PUSHes, and POPs.
- The RAM test code must be very early in the program - before
any more complex activity that requires a functioning stack. Interrupts
always make use of the stack, so be sure these are disabled!
- The test itself cannot use subroutines, variables not in registers,
or the stack.
These restrictions induce many to use only the simplest of tests.
It's common to write 0x55 to each location, read and check the
result, and then repeat the process using 0xaa. These two values
are each others complement, so at least every bit gets tested.
If any or all of the address lines in the system are hosed this
test will pass. Bummer, that, but since every value in
RAM is set to the same value, you'll never know if you are reading
location 0100 instead of 0000.
An alternative is to follow the 0x55, 0xaa test with something
that picks up address problems. Try writing the low part of the
address to each RAM location, over the entire array, and then
reading the memory to test for correctness. For example, write
00 to 0000, 01 to 0001, 02 to 0002, etc. The address, or at least
part of it, is encoded into the data, so you can be pretty sure
that the RAMs decode properly.
On an 8 bit computer each location is byte-addressable, so at
location 0100 the pattern restarts at 00. That is, the test writes
the same data to 0000, 0100, 0200, etc. Upper address line shorts
may not be detected.
Again, add another test. Write the upper part of the address to
each RAM location. 00 goes to 0000 through 00ff. Put 01 in 0100
to 01ff, and 02 in 0200 to 02ff.
For arrays up to 64k in length, then, running these four tests
insures that each bit works, and each cell addresses properly.
The code is quite simple and easily written without using intermediate
variables or the stack. The only downside is that testing large
arrays can take a long time: the code writes to every location
4 times, and then reads each 4 times. Even on a lousy 64k RAM
this is a half million accesses, each one burdened with all of
the housekeeping code needed to sequence the comparisons.
A faster test will write and read the array just once. Given that
we don't expect single bit errors, there's no need to make sure
we put a 0 and a 1 in each location as we did with the 0x55 and
0xaa tests.
A fast test must send a reasonable set of different values to
memory to make sure that the array is really writable. It must
be clever enough detect addressing problems, a common source of
trouble due to the vast number of address lines running over the
circuit board, and the likelihood that one or more may be corrupt
in some manner.
A very fast, very simple solution is to create a short string
of almost random bytes that you repeatedly send to the array until
all of memory is written. Then, read the array and compare against
the original string.
I use the phrase "almost random" facetiously, but in
fact it little matters what the string is, as long as it contains
a variety of values. It's best to be include the pathological
cases, like 00, 0xaa, ox55, and 0xff. The string is something
you pick when writing the code, so it is truly not random, but
other than these four specific values you fill the rest of it
with nearly any set of values, since we're just checking basic
write/read functions (remember: memory tends to fail in fairly
dramatic ways). I like to use very orthogonal values - those with
lots of bits changing between successive string members - to create
big noise spikes on the data lines.
To make sure this test picks up addressing problems, insure the
string's length is not a factor of the length of the memory array.
In other words, you don't want the string to be aligned on the
same low-order addresses, which might cause an address error to
go undetected.
For 64k of RAM, a string 257 bytes long is perfect. 257 is prime,
and its square is greater than the size of the RAM array. Each
instance of the string will start on a different low order address.
257 has another special magic: you can include every byte value
(00 to 0xff) in the string without effort. You can skip the actual
creation of a string in ROM by producing the values as needed,
incrementing a counter that overflows at 8 bits.
To summarize this algorithm: set an 8 bit counter to 0, and the
start address to the beginning of RAM. Write the counter's value
to RAM. Increment it, and repeat until 257 locations were written.
Now reset the counter to 0 and iterate until all of RAM is done.
Reset the counter to 0 and the address to the start of RAM and
repeat, this time reading instead of writing, checking each memory
location against the counter value.
Some folks skip the read and compare step, instead reading and
checksumming or CRCing the data. This may be marginally faster,
but you cannot tell where the failure occurred unless you stop
CRCing at the end of each 257 byte block, and make the comparison
there.
When speed is a major concern modify the algorithm by skipping
most of memory. Instead of incrementing the address at each step,
add a small prime number to the address. You'll test a lot less
of the RAM so may potentially miss some failures, but if the prime
number address offset is much smaller than the row and column
sizes of the RAM chips then you'll surely pick up most commons
problems.
Other Gotchas
DRAMs have memories rather like mine - after 2 to 4 milliseconds
go by they will probably forget unless external circuitry nudges
them with a gentle reminder. This is known as "refreshing"
the devices, and is a critical part of every DRAM-based circuit
extant.
More and more processors include built-in refresh generators,
but plenty of others still rely on rather complex external circuitry.
Any failure in the refresh system is a disaster.
Any RAM test should pick up a refresh fault - shouldn't it? After
all, it will surely take a lot longer than 2-4 msec to
write out all of the test values to even a 64k array.
Unfortunately, refresh is basically the process of cycling address
lines to the DRAMs. A completely dead refresh system won't show
up with the test indicated, since the processor will be merrily
cycling address lines like crazy as it writes and reads the devices.
There's no chance the test will find the problem. This is the
worst possible situation: the process of running the test camouflages
the failure!
The solution is simple: after writing to all of memory, just stop
toggling those pesky address lines for a while. Run a tight do-nothing
loop for a while (very tight.... the more instructions
you execution per iteration, the more address lines will toggle),
and only then do the read test. Reads will fail if the refresh
logic isn't doing its thing.
Though DRAMs are typically spec'ed at a 2-4 msec maximum refresh
interval, some hold their data for surprisingly long times. When
memories were smaller and cells larger, each had so much capacitance
you could sometimes go for dozens of seconds without losing a
bit. Today's smaller cells are less tolerant of refresh problems,
so a 1 to 2 second delay is probably adequate.
Capacitance causes another insidious problem that is easy to deal
with: the read that follows a write to a location that doesn't
exist (perhaps due to a completely dead RAM) will often return
correct data! Follow the algorithm above and write all of memory
before starting the read - capacitance can remember but a single
value, not the complex sequence you've written.
Including system tests is a good idea if, and only if, the test
has more meaning than just adding a "Includes Full Diagnostics"
line to the marketing blurbs. Good algorithms are as easy to implement
as poor ones - just think the failure modes through carefully,
before writing a lot of useless code.
|