Metastability and Firmware
 |
For hints, tricks and ideas about better ways to build embedded systems, subscribe to The Embedded Muse, a free biweekly e-newsletter. No hype, just down to earth embedded talk. 23,000 other engineers subscribe. It takes just a few seconds (all we need is your email address, which is shared with absolutely no one) to subscribe to the Embedded Muse. |
Metastability and Firmware
Last month I discussed the general problem of making
software that reads asynchronous hardware reliable. Some very simple situations
- like a timer that uses an interrupt service routine - can result in rare
but quite serious faults. Whenever we have a physical input to the computer that
requires more than one I/O read, and that continues to run during input,
there's a chance the data will be corrupt.
Suppose a robot uses a 10 bit encoder to monitor the
angular location of a wrist joint. As the wrist rotates the encoder sends back a
binary code, 10 bits wide, representing the joint's current position. An 8 bit
processor requires two distinct I/O instructions - two byte-wide reads - to
get the data. No matter how fast the computer might be there's a finite time
between the reads during which the encoder data may change.
The wrist is rotating. A "get_position"
routine reads 0xff from the low part of the position data. Then, before the next
instruction, the encoder rolls over to 0x100. "get_position"
"Courier New"">get_position" reads the high part of the data
- now 0x1 - and returns a position of 0x1ff, clearly in error and perhaps
even impossible.
This is a common problem. Handling input from a two axis
controller? If the hardware continues to move during our reads, then the X and Y
data will be slightly uncorrelated, perhaps yielding impossible results. One
friend tracked a rare autopilot failure to the way the code read a flux-gate
compass, whose output is a pair of related quadrature signals. Reading them at
disparate times, while the vessel continued to move, yielded impossible heading
data.
Input Capture Register
Hardware folks have dealt with similar problems for
decades. Their usual solution is to add an input capture register between the
I/O device and the processor. The register is nothing more than a parallel
latch, as wide as the input data. The 10 bit encoder has a 10 bit register; the
encoder's output goes to the register's inputs. A single clock line drives
each flip-flop in the latch; when strobed it locks the data into the register.
The output is fed to a pair of processor input ports.
When it's time to read a safe, unchanging value the code
issues a "hold the data now" command which strobes encoder values into the
latch. So all 10 bits are stored and can be read by the software at any time,
with no fear of things changing between reads.
Some designers tie the register's clock input to one of
the port control lines. The I/O read instruction then automatically strobes data
into the latch, assuming one is wise enough to insure the register latches data
on the leading edge of the clock.
The input capture register is a very simple way to suspend
moving data during the duration of a couple of reads. At first glance it seems
perfectly safe. But a bit of analysis shows that for asynchronous inputs it is
not reliable. We're using hardware to fix a software problem, so must be
aware of the limitations of physical logic devices.
To simplify things for a minute, let's zoom in on that
input capture register and examine just one of its bits. Each gets stored in a
flip-flop, a bit of logic that might have only three connections: data in, data
out, and clock. When the input is a one, strobing clock puts a one at the
output.
But suppose the input changes at about the same time clock
cycles? What happens? The short answer is that no one knows.
Metastable States
Every flip-flop has two critical specifications we violate
at our peril. "Set-up time" is the minimum number of nanoseconds that input
data must be stable before clock comes. "Hold time" tells us how long
to keep the data present after clock transitions. These specs vary
depending on the logic device. Some might require tens of nanoseconds of set-up
and/or hold time; others need an order of magnitude less.
If we tend to our knitting we'll respect these parameters
and the flip-flop will always be totally predictable. But when things are
asynchronous - say, the wrist rotates at it's own rate and the software does
a read whenever it needs data - there's a chance the we'll violate set-up
or hold time.
Suppose the flip-flop requires 3 nanoseconds of set-up
time. Our data changes within that window, flipping state perhaps a single
nanosecond before clock transitions. The device will go into a metastable state
where the output gets very strange indeed.
By violating the spec the device really doesn't know if
we presented a zero or a one. It's output goes, not to a logic state, but to
either a half-level (in between the digital norms) or it will oscillate,
toggling wildly between states. The flip-flop is metastable.
This craziness doesn't last long; typically after a few
to 50 nanoseconds the oscillations damp out or the half-state disappears,
leaving the output at a valid one or zero. But which one is it? This is a
digital system, and we expect ones to be ones, and zeroes zeroes.
The output is random. Bummer, that. You cannot
predict which level it will assume. That sure makes it hard to design
predictable digital systems!
Hardware folks feel that the random output isn't a
problem. Since the input changed at almost exactly the same time the clock
strobed, either a zero or a one is reasonable. If we had clocked just a hair
ahead or behind we'd have gotten a different value, anyway.
Philosophically, who knows which state we measured? Is this really a big
deal? Maybe not to the EEs, but this impacts our software in a big way, as
we'll see shortly.
Metastablility occurs only when clock and data arrive
almost simultaneously; the odds increase as clock rates soar. An equally
important factor is the type of logic component used; slower logic (like 74HCxx)
has a much wider metastable window than faster devices (say, 74FCTxx). Clearly
at reasonable rates the odds of the two asynchronous signals arriving closely
enough in time to cause a metastable situation are low; measureable, yes,
important, certainly. With a 10 MHz clock and 10 KHz data rate, using typical
but not terribly speedy logic, metastable errors occur about once a minute.
Though infrequent, no reliable system can stand that failure rate.
The classic metastable fix uses two flip flops connected in
series. Data goes to the first; it's output feeds the data input of the
second. Both use the same clock input. The second flop's output will be
"correct" after two clocks, since the odds of two metastable events
occurring back-to-back are almost nil. With two flip-flops, at reasonable data
rates errors occur millions or even billions of years apart. Good enough for
most systems.
But "correct" means the second stage's output will
not be metastable: it's not oscillating, nor is it at an illegal voltage
level. There's still an equal chance the value will be in either legal logic
state.
Firmware, not Hardware
To my knowledge there's no literature about how
metastability effects software, yet it poses very real threats to building a
reliable system.
Hardware designers smugly cure their metastability problem
using the two stage flops described. Their domain is that of a single bit, whose
input changed just about the same time the clock transition. Thinking in such
narrow terms it's indeed reasonable to accept the inherent random output the
flops generate.
But we software folks are reading parallel I/O ports, each
perhaps 8 bits wide. That means there are 8 flip-flops in the input capture
register, all driven by the same clock pulse.
Let's look at what might happen. The encoder changes from
0xff to 0x100. This small difference might represent just a tiny change in
angle. We request a read at just about the same time the data changes; our input
operation strobes the capture register's clock creating a violation of set-up
or hold time. Every input bit changes; each of the flip flops inside the
register goes metastable. After a short time the oscillations die out, but now
every bit in the register is random. Though the hardware folks might shrug and
complain that no one knows what the right value was, since everything changed as
clock arrived, in fact the data was around 0xff or 0x100. A random result of,
say, 0x12 is absurd and totally unacceptable, and may lead to crazy system
behavior.
The case where data goes from 0xff to 0x100 is pathological
since every bit changes at once. The system faces the same peril whenever lots
of bits change. 0x0f to 0x10. 0x1f to 0x20. The upper, unchanging data bits will
always latch correctly; but every changing bit is at risk.
Why not use the multiple flip-flop solution? Connect two
input capture registers in series, both driven by the same clock. Though this
will eliminate the illegal logic states and oscillations, the second stage's
output will be random as well.
One option is to ignore metastability and hope for the
best. Or use very fast logic with very narrow set-up/hold time windows to reduce
the odds of failure. If the code samples in the inputs infrequently it's
possible to reduce metastability to one chance in millions or even billions.
Building a safety critical system? Feeling lucky?
It is possible to build a synchronizer circuit that takes a
request for a read from the processor, combines it with a data available bit
from the I/O device, responding with a data-OK signal back to the CPU. This is
non-trivial and prone to errors.
An alternative is to use a different coding scheme for the
I/O device. Buy an encoder with Gray Code output, for example (if you can find
one). Gray Code is a counting scheme where only a single bit changes between
numbers, as follows:
0 000
1 001
2 011
3 010
4 110
5 111
6 101
7 100
Gray code makes sense if, and only if, your code reads the
device faster than it's likely to change, and if the changes happen in a
fairly predictable fashion - like counting up. Then there's no real chance
of more than a single bit changing between reads; if the inputs go metastable
only one bit will be wrong. The result will still be reasonable.
Another solution is to compute a parity or checksum of the
input data before the capture register. Latch that, as well, into the register.
Have the code compute parity and compare it to that read; if there's an error
do another read.
Though I've discussed adding an input capture register,
please don't think that this is the root cause of the problem. Without that
register - if you just feed the asynchronous inputs directly into the CPU -
it's quite possible to violate the processor's innate set-up/hold times.
There's no free lunch; all logic has physical constraints we must honor.
Don't Panic!
Some designs will never have a metastability problem. It
always stems from violating set-up or hold times, which in turn comes from
either poor design or asynchronous inputs.
All of this discussion has revolved around asynchronous
inputs, when the clock and data are unrelated in time. Be wary of anything not
slaved to the processor's clock. Interrupts are a notorious source of
problems. If caused by, say, someone pressing a button, be sure that the
interrupt itself, and the vector-generating logic, don't violate the
processor's set-up and hold times.
But in computer systems most things do happen
synchronously. If you're reading a timer that operates from the CPU's clock,
it is inherently synchronous to the code. From a metastability standpoint it's
totally safe.
Bad design, though, can plague any electronic system. Every
logic component takes time to propagate data; when a signal traverses many
devices the delays can add up significantly. If the data then goes to a latch
it's quite possible that the delays may cause the input to transition at the
same time as the clock. Instant metastability.
Designers are pretty careful to avoid these situations,
though. Do be wary of FPGAs and other components where the delays vary depending
on how the software routes the device. And when latching data or clocking a
counter it's not hard to create a metastability problem by using the wrong
clock edge. Pick the edge that gives the device time to settle before it's
read.
What about analog inputs? Connect a 12 bit A/D converter to
two 8 bit ports and we'd seem to have a similar problem: the analog data can
wiggle all over, changing during the time we read the two ports. However,
there's no need for an input capture register because the converter itself
generally includes a "sample and hold" block, which stores the analog signal
while the A/D digitizes. Most A/Ds then store the digital value till we start
the next conversion.
There's a lot of information about metastability in
circuits. One of the best is a Texas Instruments report (number SDYA006) named
"Metastable Response in 5-V Logic Circuits". The formulas and empirical data
included will help you quantitatively calculate the risks in your designs.
|