Ram Tests
The best ways to check your system RAM.
 |
For hints, tricks and ideas about better ways to build embedded systems, subscribe to The Embedded Muse, a free biweekly e-newsletter. No hype, just down to earth embedded talk. 23,000 other engineers subscribe. It takes just a few seconds (all we need is your email address, which is shared with absolutely no one) to subscribe to the Embedded Muse. |
In "A Day in the Life" the John Lennon wrote "He blew
his mind out in a car; he didn't notice that the lights had changed." As a
technologist this always struck me as a profound statement about the complexity
of modern life. Survival in the big city simply doesn't permit even a very
human bit of daydreaming. 20th century life means keeping a level of
awareness and even paranoia that our ancestors would have found inconceivable.
Since this song's release in 1967 survival has
become predicated on much more than the threat of a couple of tons of steel
hurtling though a red light. As I write this there's some concern that a
software error in the equipment in Guam contributed to the death of more than
200 people on the Korean airliner that crashed there in early August. Perhaps a
single bit, something so ethereal that it is nothing more than the charge held
in an impossibly small well, was incorrect. Today's version of the Beatles
song might include the refrain "he didn't notice that the bit had
flipped."
Beyond software errors lurks the specter of a
hardware failure that causes our correct code to die, possibly creating similar
horrors as in Guam, or maybe just infuriating a customer. Many of us write
diagnostic code to help contain the problem.
Keep an eye on comp.arch.embedded and you'll see,
almost like clockwork, a posting for help with RAM test algorithms. No other
diagnostic stimulates so much discussion, nor so many misguided replies.
Developers often
adhere to beliefs about the right way to test RAM that are as polarized
as disparate feelings about politics and religion. I'm no exception, and
happily have this forum for blasting my own thoughts far and wide! so will
shamelessly do so.
Obviously, a RAM problem will destroy most embedded
systems. Errors reading from the stack will sure crash the code. Problems,
especially intermittent ones, in the data areas may manifest bugs in subtle
ways. Often you'd rather have a system that just doesn't boot, rather than
one that occasionally returns incorrect answers.
Some embedded systems are pretty tolerant of memory
problems. We hear of NASA spacecraft from time to time whose core or RAM
develops a few bad bits, yet somehow the engineers patch their code to operate
around the faulty areas, uploading the corrections over the distances of
billions of miles.
Most of us work on systems with far less human
intervention. There are no teams of highly trained personal anxiously monitoring
the health of each part of our products. It's our responsibility to build a
system that works properly when the hardware is functional.
In some applications, though, a certain amount of
self-diagnosis either makes sense or is required; critical life support
applications should use every diagnostic concept possible to avoid disaster due
to a sub-micron RAM imperfection.
So, my first belief about diagnostics in general, and
RAM tests in particular, is to clearly define your goals. Why run the test? What
will the result be? Who will be the unlucky recipient of the bad news in the
event an error is found, and what do you expect that person to do?
Will a RAM problem kill someone? If so, a very
comprehensive test, run regularly, is mandatory.
Is such a failure merely a nuisance? For instance, if
it keeps a cell phone from booting, if there's nothing the customer can do
about the failure anyway, then perhaps there's no reason for doing a test. As
a consumer I could care less why the damn phone stopped working! if it's
dead I'll take it in for repair or replacement.
Is production test - or even engineering test - the
real motivation for writing diagnostic code? If so, then define exactly what
problems you're looking for and write code that will find those sorts of
troubles.
Next, inject a dose of reality into your evaluation.
Remember that today's hardware is often very highly integrated. In the case of
a microcontroller with on-board RAM the chances of a memory failure that
doesn't also kill the CPU is small. Again, if the system is a critical life
support application it may indeed make sense to run a test as even a minuscule
probability of a fault may spell disaster.
Does it make sense to ignore RAM failures? If your
CPU has an illegal instruction trap, there's a pretty good chance that memory
problems will cause a code crash you can capture and process. If the chip
includes protection mechanisms (like the x86 protected mode), count on bad stack
reads immediately causing protection faults your handlers can process. Perhaps
RAM tests are simply not required given these extra resources.
Inverting Bits
The USENET postings often suggest using the simplest of
tests - writing alternating 0x55 and 0xAA values to the entire memory array, and
then reading the data to insure it remains accessible. It's a seductively easy
approach that will find an occasional problem (like, someone forgot to load all
of the RAM chips), but that detects few real world errors.
Remember that RAM is an array divided into columns
and rows. Accesses require proper chip selects and addresses sent to the array
-- and not a lot more. The 0x55/0xAA symmetrical pattern repeats massively all
over the array; accessing problems (often more common than defective bits in the
chips themselves) will create references to incorrect locations, yet almost
certainly will return what appears to be correct data.
Consider the physical implementation of memory in
your embedded system. The processor drives address and data lines to RAM - in a
16 bit system there will surely be at least 36 of these. Any short or open on
this huge bus will drive create bad RAM accesses. Problems with the PC board are
far more common than internal chip defects, yet the 0x55/0xAA test is singularly
poor at picking up these, the most likely, failures.
Yet, the
simplicity of this test and it's very rapid execution has made it an old
standby used much too often. Isn't there an equally simple approach that will
pick up more problems?
If your goal is to detect the most common faults (PCB
wiring errors and chip failures more substantial than a few bad bits here or
there), then indeed there is. Create a short string of almost random bytes that
you repeatedly send to the array until all of memory is written. Then, read the
array and compare against the original string.
I use the phrase "almost random" facetiously, but
in fact it little matters what the string is, as long as it contains a variety
of values. It's best to be include the pathological cases, like 00, 0xaa,
ox55, and 0xff. The string is something you pick when writing the code, so it is
truly not random, but other than these four specific values you fill the rest of
it with nearly any set of values, since we're just checking basic write/read
functions (remember: memory tends to fail in fairly dramatic ways). I like to
use very orthogonal values - those with lots of bits changing between successive
string members - to create big noise spikes on the data lines.
To make sure this test picks up addressing problems,
insure the string's length is not a factor of the length of the memory array.
In other words, you don't want the string to be aligned on the same low-order
addresses, which might cause an address error to go undetected. Since the string
is much shorter than the length of the RAM array, you insure it repeats at a
rate that is not related to the row/column configuration of the chips.
For 64k of RAM, a string 257 bytes long is perfect.
257 is prime, and its square is greater than the size of the RAM array. Each
instance of the string will start on a different low order address. 257 has
another special magic: you can include every byte value (00 to 0xff) in the
string without effort. Instead of manually creating a string in your code, build
it in real time by incrementing a counter that overflows at 8 bits.
Critical to this, and every other RAM test algorithm,
is that you write the pattern to all of RAM before doing the read test. Some
people like to do non-destructive RAM tests by testing one location at a time,
then restoring that location's value, before moving on to the next one. Do
this and you'll be unable to detect even the most trivial addressing problem.
This algorithm writes and reads every RAM location
once, so is quite fast. Improve the speed even more by skipping bytes, perhaps
writing and reading every 3rd or 5th entry. The test will
be a bit less robust yet will still find most PCB and many RAM failures.
Some folks like to run a test that exercises each and
every bit in their RAM array. Though I remain skeptical of the need since most
semiconductor RAM problems are rather catastrophic, if you do feel compelled to
run such a test, consider adding another iteration of the algorithm just
described, with all of the data bits inverted.
Detailed Diagnostics
Sometimes, though, you'll want a more thorough test,
something that looks for difficult hardware problems at the expense of speed.
When I speak to groups I'll often ask "what makes
you think the hardware really
works?" The response is usually a shrug of the shoulders, or an off-the-cuff
remark about everything seeming to function properly, more or less, most of the
time.
These qualitative responses are simply not adequate
for today's complex systems. All too often a prototype that seems perfect
harbors hidden design faults that may only surface after you've built a
thousand production units. Recalling products due to design bugs is unfair to
the customer and possibly a disaster to your company.
Assume the design is absolutely ridden with problems.
Use reasonably methodologies to find the bugs before building the first
prototype, but then use that first unit as a testbed to find the rest of the
latent troubles.
Large arrays of RAM memory are a constant source of
reliability problems. It's indeed quite difficult to design the perfect RAM
system, especially with the minimal margins and high speeds of today's 16 and
32 bit systems. If your system uses more than a couple of RAM parts, count on
spending some time qualifying its reliability via the normal hardware diagnostic
procedures. Create software RAM tests that hammer the array mercilessly.
Probably one of the most common forms of reliability
problems with RAM arrays is pattern sensitivity. Now, this is not the famous
pattern problems of yore, where the chips (particularly DRAMs) were sensitive to
the groupings of ones and zeroes. Today the chips are just about perfect in this
regard. No, today pattern problems come from poor electrical characteristics of
the PC board, decoupling problems, electrical noise, and inadequate drive
electronics.
PC boards were once nothing more than wiring
platforms, slabs of tracks that propagated signals with near perfect fidelity.
With very high speed signals, and edge rates (the time it takes a signal to go
from a zero to a one or back) under a nanosecond, the PCB itself assumes all of
the characteristics of an electronic component - one whose virtues are almost
all problematic. It's a big subject (refer to read "High Speed Digital
Design -a Handbook of Black Magic" by Howard Johnson and Martin Graham (1993
PTR Prentice Hall, NJ for the canonical words of wisdom on this subject), but
suffice to say a poorly designed PCB will create RAM reliability problems.
Equally important are the decoupling capacitors
chosen, as well as their placement. Inadequate decoupling will create
reliability problems as well.
Modern DRAM arrays are massively capacitive. Each
address line might drive dozens of chips, with 5 to 10 pf of loading per chip.
At high speeds the drive electronics must somehow drag all of these
pseudo-capacitors up and down with little signal degradation. Not an easy job!
Again, poorly designed drivers will make your system unreliable.
Electrical noise is another reliability culprit,
sometimes in unexpected ways. For instance, CPUs with multiplexed address/data
buses use external address latches to demux the bus. A signal, usually named ALE
(Address Latch Enable) or AS (Address Strobe) drives the clock to these latches.
The tiniest, most miserable amount of noise on ALE/AS will surely, at the time
of maximum inconvenience, latch the data part of the cycle instead of the
address. Other signals are also vulnerable to small noise spikes.
Many run of the mill RAM test, run for several hours,
as you cycle the product through it's design environment (temperature, etc)
will show intermittent RAM problems. These are symptoms of the design faults
I've described, and always show a need for more work on the product's
engineering.
Unhappily, all too often the RAM tests show no
problem when hidden demons are indeed lurking. The algorithm I've described,
as well as most of the others commonly used, tradeoff speed versus
comprehensiveness. They don't pound on the hardware in a way designed to find
noise and timing problems.
Digital systems are most susceptible to noise when
large numbers of bits change all at once. This fact was exploited for data
communications long ago with the invention of the Gray Code, a variant of binary
counting, where no more than one bit changes between codes. Your worst
nightmares of RAM reliability occur when all of the address and/or data bits
change suddenly from zeroes to ones.
For the sake of engineering testing, write RAM test
code that exploits this known vulnerability. Write 0xffff to 0x0000 and then to
0xffff, and do a read-back test. Then write zeroes. Repeat as fast as your loop
will let you go.
Depending on your CPU, the worst locations might be
at 0x00ff and 0x0100, especially on 8 bit processors that multiplex just the
lower 8 address lines. Hit these combinations, hard, as well.
Other addresses often exhibit similar pathological
behavior. Try 0x5555 and 0xaaaa, which also have complementary bit patterns.
The trick is to write these patterns back-to-back. Don't
test all of RAM, with the understanding that both 0x0000 and 0xffff will show up
in the test. You'll stress the system most effectively by driving the bus
massively up and down all at once.
Don't even think about writing this sort of code in
C. Any high level language will inject too many instructions between those that
move the bits up and down. Even in assembly the processor will have to do fetch
cycles from wherever the code happens to be, which will slow down the pounding
and make it a bit less effective.
There are some tricks, though. On a CPU with a
prefetcher (all x86, 68k, etc.) try to fill the execution pipeline with code, so
the processor does back-to-back writes or reads at the addresses you're trying
to hit. And, use memory-to-memory transfers when possible. For example:
mov si,0xaaaa
mov di,0x5555
mov [si],0xff
mov [di],[si]
The Moral
As with most design decisions, before writing RAM test code
question your motivations deeply and select a testing strategy that
makes sense for your application. Tradeoff speed and test
comprehensiveness to meet your goals.
Possibly the hardest decision is what to do when a
failure crops up. That, though, is subject for another column.
|