Asynchronous Hardware/Firmware
 |
For hints, tricks and ideas about better ways to build embedded systems, subscribe to The Embedded Muse, a free biweekly e-newsletter. No hype, just down to earth embedded talk. 23,000 other engineers subscribe. It takes just a few seconds (all we need is your email address, which is shared with absolutely no one) to subscribe to the Embedded Muse. |
Asynchronous Hardware/Firmware
What makes firmware reliable?
Certainly there are a lot of ingredients. One is that it
works properly at the boundary conditions, when parameters passed to functions
are at extreme values, or, in a real-time system when system loading skyrockets
to its max. Too many systems become brittle pushed to the edge; worse, it's
almost impossible to create tests for these rare-but-possible conditions.
We work at that fuzzy interface between hardware and
software, which creates additional problems due to the interactions of our code
and the device. Some cause erratic and quite impossible-to-diagnose crashes that
infuriate our customers. The worst bugs of all are those that appear
infrequently, that can't be reproduced. Yet a reliable system just cannot
tolerate any sort of defect, especially the random one that passes our tests,
perhaps dismissed with the "ah, it's just a glitch" behavior.
Potential evil lurks whenever hardware and software
interact asynchronously. That is, when some physical device runs at its own
rate, sampled by firmware running at some different speed.
I was poking through some open-source code recently and
came across a typical example of asynchronous interactions. The RTEMS real time
operating system provided by OAR Corporation (ftp://ftp.oarcorp.com/pub/rtems/releases/4.5.0/
for the code) is a nicely written, well organized product with a lot of neat
features. But the timer handling routines, at least for the 68302 distribution,
is flawed in a way that will fail infrequently but possibly catastrophically.
This is just one very public example of the problem; I constantly see in buried
in proprietary firmware.
The code is simple and straightforward, and looks much like
any other timer handler. There's an interrupt service routine invoked when the
16 bit hardware timer overflows. The ISR services the hardware, increments a
global variable named ftp://ftp.oarcorp.com/pub/rtems/releases/4.5.0/
"Times New Roman"">Timer_interrupts, and returns. So Timer_interrupts
maintains the number of times the hardware counted to 65536.
Function Timer_interrupts
"Times New Roman"">Read_timer returns the current "time"
(the elapsed time in microseconds as tracked by the ISR and the hardware timer).
It, too, is delightfully free of complications. Like most of these sorts of
routines it reads the current contents of the hardware's timer register,
shifts Timer_interrupts
left 16 bits, and adds in the value read from the timer. That is, the current
time is the concatenation the timer's current value and the number of
overflows.
Suppose the hardware rolled over 5 times, creating five
interrupts. Timer_interrupts
equals 5. Perhaps the internal register is, when we call Read_timer,
0x1000. The routine returns a value of 0x51000. Simple enough and seemingly
devoid of problems.
Race Conditions
But let's think about this more carefully. There's
really two things going on at the same time. Not concurrently, which
means "apparently at the same time", as in a multitasking environment where
the RTOS doles out CPU resources so all tasks appear to be running
simultaneously. No, in this case the code in Read_timer,
mso-bidi-font-family:"Times New Roman"">Read_timer executes
whenever called, and the clock-counting timer runs at its own rate. The two are
asynchronous.
A fundamental rule of hardware design is to panic whenever
asynchronous events suddenly synchronize. For instance, when two different
processors share a memory array there's quite a bit of convoluted logic
required to insure that only one gets access at any time. If the CPUs use
different clocks the problem is much trickier, since the designer may find the
two requesting exclusive memory access within fractions of a nanosecond of each
other. This is called a "race" condition and is the source of many gray
hairs and dramatic failures.
One of Read_timer's race conditions might be:
It reads the hardware and gets, let's say, a value of 0xffff.
Before having a chance to retrieve the high part of the time from
variable Timer_interrupts, the hardware increments
again to 0x0000.
The overflow triggers an interrupt. The ISR runs. Timer_interrupts
is now 0x0001, not 0 as it was just nanoseconds before.
The ISR returns, our fearless Read_timer
routine, with no idea an interrupt occurred, blithely concatenates the new
0x0001 with the previously-read timer value of 0xffff, and returns 0x1ffff - a
hugely incorrect value.
Or, suppose Read_timer is called
during a time when interrupts are disabled - say, if some other ISR needs the
time. One of the few perils of writing encapsulated code and drivers is that
you're never quite sure what state the system is in when the routine gets
called. In this case:
Read_timer starts. The timer is 0xffff with
no overflows.
Before much else happens it counts to 0x0000. With interrupts off
the pending interrupt gets deferred.
Read_timer returns a value of 0x0000 instead
of the correct 0x10000, or the reasonable 0xffff.
So the algorithm that seemed so simple has quite subtle
problems, necessitating a more sophisticated approach. The RTEMS RTOS, at least
in its 68k distribution, will likely create infrequent but serious errors.
Sure, the odds of getting a mis-read are small. In fact,
the chance of getting an error plummets as the frequency we call Read_timer
decreases. How often will the race condition surface? Once a week? Monthly?
Many embedded systems run for years without rebooting.
Reliable products must never contain fragile code. Our challenge as
designers of robust systems is to identify these sorts of issues and create
alternative solutions that work correctly, every time.
Just weeks ago an engineer told me his team spent three
months tracking down this sort of race problem, also in a timer driver. The bug
appeared so infrequently it seemed a ghost, but their safety-critical product
could not crash, ever. Can you imagine the cost of three extra months of
debugging?
Options
Fortunately a number of solutions do exist. The easiest is
to stop the timer before attempting to read it. There will be no chance of an
overflow putting the upper and lower halves of the data out of sync. This is a
simple and guaranteed solution.
We will lose time. Since
the hardware generally counts the processor's clock, or clock divided by a
small number, it may lose quite a few ticks during the handful of instructions
executed to do the reads. The problem will be much worse if an interrupt causes
a context switch after disabling the counting. Turning interrupts off during
this period will eliminate unwanted tasking, but increases both system latency
and complexity.
I just hate disabling interrupts; system latency
goes up and sometimes the debugging tools get a bit funky. When reading code a
red flag goes up if I see a lot of disable interrupt instructions sprinkled
about. Though not necessarily bad, it's often a sign that either the code was
beaten into submission (made to work by heroic debugging instead of careful
design), or there's something quite difficult and odd about the environment.
Another solution is to read the Since
"Courier New";mso-bidi-font-family:"Times New Roman"">Timer_interrupts
variable, then the hardware timer, and then re-read
"Courier New";mso-bidi-font-family:"Times New Roman"">Timer_interrupts.
An interrupt occurred if both variable values aren't identical. Iterate
till the two variable reads are equal. The upside: correct data, interrupts stay
on, and the system doesn't lose counts.
The downside: in a heavily-loaded, multitasking
environment, it's possible that the routine could loop for rather a long time
before getting two identical reads. The function's execution time is
non-deterministic. We've gone from a very simple timer reader to somewhat more
complex code that could run for milliseconds instead of microseconds.
Another alternative might be to simply disable interrupts
around the reads. This will prevent the ISR from gaining control and changing Timer_interrupts
after we've already read it, but creates another issue.
We enter Timer_interrupts
"Times New Roman"">Read_timer and immediately shut down
interrupts. Suppose the hardware timer is at our notoriously-problematic 0xffff,
and Timer_interrupts
is zero. Now, before the code has a chance to do anything else, the overflow
occurs. With context switching shut down we miss the rollover. The code reads a
zero from both the timer register and from Timer_interrupts
mso-bidi-font-family:"Times New Roman"">Timer_interrupts,
returning zero instead of the correct 0x10000, or even a reasonable 0x0ffff.
Yet disabling interrupts is probably indeed a good thing to
do, despite my rant against this practice. With them on there's always the
chance our reading routine will be suspended by higher priority tasks and other
ISRs for perhaps a very long time. Maybe long enough for the timer to roll over
several times. So let's try to fix the code. Consider the following:
Read_timer(void){
unsigned int low, high;
push_interrupt_state;
disable_interrupts;
low=inword(Timer_register);
high=Timer_interrupts;
if(inword(timer_overflow)){
++high;
low=inword(timer_register);}
pop_interrupt_state;
return (((ulong)high)<<16 + (ulong)low);
}
We've made three changes to the RTEMS code. First,
interrupts are off, as described.
Second, you'll note that there's no explicit interrupt
re-enable. Two new pseudo-C statements have appeared which push and pop the
interrupt state. Trust me for a moment - this is just a more sophisticated way
to manage the state of system interrupts.
The third change is a new test that looks at something
called "timer_overflow",
an input port that is part of the hardware. Most timers have a testable bit that
signals an overflow took place. We check this to see if an overflow occurred
between turning interrupts off and reading the low part of the time from the
device. With an inactive ISR variableTimer_interrupts
mso-bidi-font-family:"Times New Roman""> Timer_interrupts
won't properly reflect such an overflow.
We test the status bit and reread the hardware count if an
overflow had happened. Manually incrementing the high part corrects for the
suspended ISR. The code then concatenates the two fixed values and returns the
correct result. Every time.
With interrupts off we have increased latency. However,
there are no loops; the code's execution time is entirely deterministic.
Push State?
But what's all of this pushing and popping? And where's
the enable interrupts instruction?
Good software design encapsulates actions. We localize all
access to particular resources with driver routines. If you looked at firmware
from 15 or 20 years ago you'd be appalled at how so many developers casually
sprinkled I/O instructions throughout all of the code. Today most (not all,
unhappily) of us would use a single routine - like Timer_interrupts
"Courier New"">Read_timer - every time we wanted access to
the timer.
Encapsulation implies, though, that the one driver must be
quite generic and work properly regardless of the system's state. It
shouldn't corrupt an LED's status, for example.
What if, for some reason we can't anticipate when writing
this driver, someone calls it with interrupts already disabled? Using the
conventional DI/EI pair will cause the system state to change when it returns.
That could be catastrophic.
To safely disable/re-enable interrupts save the interrupt
status first, issue a disable instruction, and then pop the saved interrupt
state back into the processor status word. Use the Timer_interrupts
"Courier New"">pragma or similar construct offered by most
cross compilers to gain access to these low-level hardware functions. Build a
macro that generates a bit of in-line assembly if the compiler is so brain dead
there's no way to handle interrupts intrinsically.
Other RTOSes
Unhappily, race conditions occur anytime we're need more
than one read to access data that's changing asynchronously to the software.
If you're reading X and Y coordinates, even with just 8 bits of resolution,
from a moving machine there's some peril they could be seriously out of sync
if two reads are required. A ten bit encoder managed through byte-wide ports
potentially could create a similar risk.
Having dealt with this problem in a number of embedded
systems over the years, I wasn't too shocked to see it in the RTEMS RTOS.
It's a pretty obscure issue, after all, though terribly real and potentially
deadly. For fun I looked through the source of uC/OS, another very popular
operating system whose source is on the net (see www.ucos-ii.com).
uC/OS never reads the timer's hardware. It only counts overflows as detected
by the ISR, as there's no need for higher resolution. There's no chance of
an incorrect value.
Some of you, particularly those with hardware backgrounds,
may be clucking over an obvious solution I've yet to mention. Add an input
capture register between the timer and the system; the code sets a "lock the
value into the latch" bit, then reads this safely unchanging data.
That solution, too, is fraught with peril and in many
instances will not work. More next month!
|