Non-Volitile RAM
How to build a back-up circuit. Originally in Embedded Systems Programming,
April, 1999.
 |
For hints, tricks and ideas about better ways to build embedded systems, subscribe to The Embedded Muse, a free biweekly e-newsletter. No hype, just down to earth embedded talk. 23,000 other engineers subscribe. It takes just a few seconds (all we need is your email address, which is shared with absolutely no one) to subscribe to the Embedded Muse. |
Many of the embedded systems that run our lives try
to remember a little bit about us, or about their application domain, despite
cycling power, brownouts, and all of the other perils of fixed and mobile
operation. In the bad old days before microprocessors we had core memory, a
magnetic media that preserved its data when powered or otherwise.
Today we face a wide range of choices. Sometimes Flash or
EEPROM is the natural choice for non volatile applications. Always remember,
though, that these devices have limited numbers of write cycles. Worse, in some
cases writes can be very slow.
Battery-backed up RAMs still account for a large
percentage of non-volatile systems. With robust hardware and software support
they'll satisfy the most demanding of reliability fanatics; a little less
design care is sure to result in occasional lost data.
Supervisory Circuits
In the early embedded days we were mostly blissfully
unaware of the perils of losing power. Virtually all reset circuits were nothing
more than a resistor/capacitor time constant. As Vcc ramped from 0 to 5 volts,
the time constant held the CPU's reset input low - or lowish - long enough for
the system's power supply to stabilize at 5 volts.
Though an elegantly simple design, RC time constants
were flawed on the back end, when power goes away. Turn the wall switch off, and
the 5 volt supply quickly decays to zero. Quickly only in human terms, of
course, as many milliseconds went by while the CPU was powered by something
between 0 and 5. The RC circuit is, of course, at this point at a logic one
(not-reset), so allows the processor to run.
And run they do! With Vcc down to 3 or 4 volts most
processors execute instructions like mad. Just not the ones you'd like to see.
Run a CPU with out-of-spec power and expect random operation. There's a good
chance the machine is going wild, maybe pushing and calling and writing and
generally destroying the contents of your battery backed up RAM.
Worse, brown-outs, the plague of summer air
conditioning, often cause small dips in voltage. If the AC mains decline to 80
volts for a few seconds a power supply might still crank out a few volts. When
AC returns to full rated values the CPU is still running, back at 5 volts, but
now horribly confused. The RC circuit never notices the dip from 5 to 3 or so
volts, so the poor CPU continues running in it's mentally unbalanced state.
Again, your RAM is at risk.
Motorola, Maxim, and others developed many ICs
designed specifically to combat these problems. Though features and specs vary,
these supervisory circuits typically manage the processor's reset line,
battery power to the RAM, and the RAM's chip selects.
Given that no processor will run reliably outside of
its rated Vcc range, the first function of these chips is to assert reset
whenever Vcc falls below about 4.7 volts (on 5 volt logic). Unlike an RC circuit
which limply drools down as power fails, supervisory devices provide a snappy
switch between a logic zero and one, bringing the processor to a sure, safe
stopped condition.
They also manage the RAM's power, a tricky problem
since it's provided from the system's Vcc when power is available, and from
a small battery during quiescent periods. The switchover is instantaneous to
keep data intact.
With RAM safely provided with backup power and the
CPU driven into a reset state, a decent supervisory IC will also disable all
chip selects to the RAM. The reason? At some point after Vcc collapses you
can't even be sure the processor, and your decoding logic, will not create
rogue RAM chip selects. Supervisory ICs are analog beasts, conceived outside of
the domain of discrete ones and zeroes, and will maintain safe reset and chip
select outputs even when Vcc is gone.
But check the specs on the IC. Some disable chip
selects at exactly the same time they assert reset, asynchronously to what the
processor is actually doing. If the processor initiates a write to RAM, and a
nanosecond later the supervisory chip asserts reset and disables chip select,
that write cycle will be one nanosecond long. You
cannot play with write timing and expect predictable results. Allow any
write in progress to complete before doing something as catastrophic as a reset.
Some of these chips also assert an NMI output when
power starts going down. Use this to invoke your "oh_my_god_we're_dying"
routine.
(Since processors usually offer but a single NMI
input, when using a supervisory circuit never have any other NMI source.
You'll need to combine the two signals somehow; doing so with logic is a
disaster, since the gates will surely go brain dead due to Vcc starvation).
Check the specs on the parts, though, to insure that
NMI occurs before the reset clamp
fires. Give the processor a handful of microseconds to respond to the interrupt
before it enters the idle state.
There's a subtle reason why it makes sense to have
an NMI power-loss handler: you want to get the CPU away from RAM. Stop it from
doing RAM writes before reset occurs.
If reset happens in the middle of a write cycle, there's no telling what will
happen to your carefully protected RAM array. Hitting NMI first causes the CPU
to take an interrupt exception, first finishing the current write cycle if any.
This also, of course, eliminates troubles caused by chip selects that disappear
synchronous to reset.
Every battery-backed up system should use a decent
supervisory circuit; you just cannot expect reliable data retention otherwise.
Yet, these parts are no panacea. The firmware itself is almost certainly doing
things destined to defeat any bit of external logic.
Multi-byte Writes
Steve Lund wrote recently about a very subtle failure mode
that afflicts all too many battery-backed up systems. He observed that in a
kinder, gentler world than the one we inhabit all memory transactions would
require exactly one machine cycle, but here on Earth 8 and 16 bit machines
constantly manipulate large data items. Floating point variables are typically
32 bits, so any store operation requires two or four distinct memory writes.
Ditto for long integers.
The use of high level languages accentuates the size of
memory stores. Setting a character array, or defining a big structure, means
that the simple act of assignment might require tens or hundreds of writes.
Consider the simple statement:
a=0x12345678;
An x86 compiler will typically generate code like:
mov [bx], 5678
mov [bx+2],1234
which is perfectly reasonable and seemingly robust.
In a system with a heavy interrupt burden it's
likely that sooner or later an interrupt will switch CPU contexts between the
two instructions, leaving the variable "a" half-changed, in what is possibly
an illegal state. This serious problem is easily defeated by avoiding global
variables - as long as "a" is a local, no other task will ever try to use it
in the half-changed state.
Power-down concerns twist the problem in a more
intractable manner. As Vcc dies off a seemingly well designed system will
generate NMI while the processor can still think clearly. If that interrupt
occurs during one of these multi-byte writes - as it eventually surely will,
given the perversity of nature - your device will enter the power-shutdown code
with data now corrupt. It's quite likely (especially if the data is
transferred via CPU registers to RAM) that there's no reasonable way to
reconstruct the lost data.
The simple expedient of eliminating global variables has no
benefit to the power-down scenario.
Can you imagine the difficulty of
normal">finding a problem of this nature? One that occurs maybe once every
several thousand power cycles, or less? In many systems it may be entirely
reasonable to conclude that the frequency of failure is so low the problem might
be safely ignored. This assumes you're not working on a safety-critical
device, or one with mandated minimal MTBF numbers.
Before succumbing to the temptation to let things
slide, though, consider implications of such a failure. Surely once in a while a
critical data item will go bonkers. Does this mean your instrument might then
exhibit an accuracy problem (for example, when the numbers are calibration
coefficients)? Is there any chance things might go to an unsafe state? Does the
loss of a critical communication parameter mean the device is dead until the
user takes some presumably drastic action?
If the only downside is that the user's TV set
occasionally - and rarely - forgets the last channel selected, perhaps there's
no reason to worry much about losing multi-byte data. Other systems are not so
forgiving.
Steve suggested implementing a data integrity check on
power-up, to insure that no partial writes left big structures partially
changed. I see two different directions this approach might take.
The first is a simple power-up check of RAM to make
sure all data is intact. Every time a truly critical bit of data changes, update
the CRC, so the boot-up check can see if data is intact. If not, at least let
the user know that the unit is sick, data was lost, and some action might be
required.
A second, and more robust, approach is to complete
every data item write with a checksum or CRC of just that variable. Power-up
checks of each item's CRC then reveals which variable was destroyed. Recovery
software might, depending on the application, be able to fix the data, or at
least force it to a reasonable value while warning the user that, whilst all is
not well, the system has indeed made a recovery.
Though CRCs are an intriguing and seductive solution I'm
not so sanguine about their usefulness. Philosophically it
normal">is important to warn the user rather than to crash or use bad data.
But it's much better to never crash at all.
We can learn from the OOP community and change the
way we write data to RAM (or, at least the critical items for which battery
back-up is so important).
First, hide critical data items behind drivers. The
best part of the OOP triptych mantra "encapsulation, inheritance,
polymorphism" is
"encapsulation". Bind the data items with the code that uses them. Avoid
globals; change data by invoking a routine, a method, that does the actual work.
Debugging the code becomes much easier, and reentrancy problems diminish.
Second, add a "
mso-bidi-font-family:"Times New Roman"">flush_writes" routine
to every device driver that handles a critical variable. "
mso-bidi-font-family:"Times New Roman"">Flush_writes"
finishes any interrupted write transaction.
mso-bidi-font-family:"Times New Roman"">Flush_writes relies on
the fact that only one routine - the driver - ever sets the variable.
Next, enhance the NMI power-down code to invoke all
of the flush_write
routines. Part of the power-down sequence then finishes all pending
transactions, so the system's state will be intact when power comes back.
The downside to this approach is that you'll need a
reasonable amount of time between detecting that power is going away, and when
Vcc is no longer stable enough to support reliable processor operation.
Depending on the number of variables needed flushing this might mean hundreds of
microseconds.
Firmware people are often treated as the scum of the
earth, as they inevitably get the hardware (late) and are still required to get
the product to market on time. Worse, too many hardware groups don't listen
to, or even solicit, requirements from the coding folks before cranking out
PCBs. This, though, is a case where the firmware requirements clearly drive the
hardware design. If the two groups don't speak, problems will result.
Some supervisory chips do provide advanced warning of
immanent power-down. Maxim's (www.maxim-ic.com) MAX691, for example, detects
Vcc failing below some value before shutting down RAM chip selects and slamming
the system into a reset state. It also includes a separate voltage threshold
detector designed to drive the CPU's NMI input when Vcc falls below some value
you select (typically by selecting resistors). It's important to set this
threshold above the point where the part goes into reset. Just as critical is
understanding how power fails in your system. The capacitors, inductors, and
other power supply components determine how much "alive" time your NMI
routine will have before reset occurs. Make sure it's enough.
I mentioned the problem of power failure corrupting
variables to Scott Rosenthal, one of the smartest embedded guys I know. His
casual "yeah, sure, I see that all the time" got me interested. It seems
that one of his projects, an FDA-approved medical device, uses hundreds of
calibration variables stored in RAM. Losing any one means the instrument has to
go back for readjustment. Power problems are just not acceptable.
His solution is a hybrid between the two approaches
just described. The firmware maintains two separate RAM areas, with critical
variables duplicated in each. Each variable has it's own driver.
When it's time to change a variable, the driver
sets a bit that indicates "change in process". It's updated, and a CRC is
computed for that data item and stored with the item. The driver un-asserts the
bit, and then performs the exact same function on the variable stored in the
duplicate RAM area.
On power-up the code checks to insure that the CRCs
are intact. If not, that indicates
the variable was in the process of being changed, and is not correct, so data
from the mirrored address is used. If both CRCs are OK, but the "being
changed" bit is asserted, then the data protected by that bit is invalid, and
correct information is extracted from the mirror site.
The result? With thousands of instruments in the
field, over many years, not one has ever lost RAM.
Testing
Good hardware and firmware design leads to reliable
systems. You won't know for sure, though, if your device really meets design
goals without an extensive test program. Modern embedded systems are just too
complex, with too much hard-to-model hardware/firmware interaction, to expect
reliability without realistic testing.
This means you've got to pound on the product, and
look for every possible failure mode. If you've written code to preserve
variables around brown-outs and loss of Vcc, and don't conduct a meaningful
test of that code, you'll probably ship a subtly broken product.
In the past I've hired teenagers to mindlessly and
endlessly flip the power switch on and off, logging the number of cycles and the
number of times the system properly comes to life. Though I do believe in
brining youngsters into the engineering labs to expose them to the cool parts of
our profession, sentencing them to mindless work is a sure way to convince them
to become lawyers rather than techies.
Better, automate the tests. The Poc-It, from
Microtools (www.microtoolsinc.com/products.htm) is an indispensable $250 device
for testing power-fail circuits and code. It's
also a pretty fine way to find unitialized variables, as well as isolating those
awfully-hard to initialize hardware devices like some FPGAs.
The Poc-It brainlessly turns your system on and off,
counting the number of cycles. Another counter logs the number of times a logic
signal asserts after power comes on. So, add a bit of test code to your firmware
to drive a bit up when (and if) the system properly comes to life. Set the Poc-It
up to run for a day or a month; come back and see if the number of power cycles
is exactly equal to the number of successful assertions of the logic bit.
Anything other than equality means something is dreadfully wrong.
Conclusion
When embedded processing was relatively rare, the
occasional weird failure meant little. Hit the reset button and start over.
That's less of a viable option now. We're surrounded by hundreds of CPUs,
each doing its thing, each affecting our lives in different ways. Reliability
will probably be the watchword of the next decade as our customers refuse to put
up with the quirks that are all too common now.
The current drive is to add the maximum number of
features possible to each product. I see cell phones that include games.
Features are swell! if they work, if the product always fulfills its intended
use. Cheat the customer out of reliability and your company is going to lose.
Power cycling is something every product does, and is too important to ignore.
Thanks to Steve Lund for his thoughts and concerns,
and to Scott Rosenthal (www.sltf.com) for his ideas.
|