Understand Your User's Needs
Understand your user's needs; only then can you be sure the code
is useful, as well as correct.
Published in Embedded Systems Programming, November 1991
 |
For hints, tricks and ideas about better ways to build embedded systems, subscribe to The Embedded Muse, a free biweekly e-newsletter. No hype, just down to earth embedded talk. 23,000 other engineers subscribe. It takes just a few seconds (all we need is your email address, which is shared with absolutely no one) to subscribe to the Embedded Muse. |
Call me Ishmael. I'm writing this in mid-Atlantic, bound from
Baltimore to Plymouth, England aboard my 35 foot sailboat. Like
Melville, I find relief from the pressures of modern life by chasing
adventure at sea.
Amber II, though 30 years old, nevertheless is a child of the
microprocessor revolution. Over the years I've added a lot of
electronics to make sailing easier and safer; each addition brings
yet one more embedded system aboard. With the exception of the
beautifully designed digital VHF radio, every microprocessor-based
product on the boat suffers from one or more design defects that
erodes the equipment's usefulness just a bit.
The environment on a small boat far offshore couldn't be worse
for electrical items. Salt spray and high humidity insidiously
find their way inside even the best enclosures, rapidly corroding
every soldered connection. Switch contacts are the first to go,
followed by connectors and the tracks on PC boards. A partial
solution is to use only the finest gold contacts, an obvious approach
all too few vendors employ.
Regular readers of this column know I've long been an advocate
of using software to solve system-wide and application-wide problems.
While the marine environment is perhaps a bit extreme, every system
is subject to mechanical and electrical failures. After all, even
in the most benign laboratory conditions contacts get dirty. It
makes sense to design code that will work in at least some fashion
if, say, a switch fails. In situations where failures are likely
or inevitable, a wise designer will devise software solutions,
even if the system cannot continue to run with complete functionality.
Given that dirty or corroded contacts are a perennial source of
trouble, embedded code that relies on functional switches should
always check input bits for validity. Obviously, if only a single
switch should ever be pressed at a time, then by all means don't
accept conditions where several are asserted. But be kind to your
users - a switch failure may erroneously create this condition.
Can the system ignore the extra bit (perhaps by seeing it always
asserted), and carry on?
A lot of systems use a debouncing algorithm that will loop forever
if an input is shorted. Don't let a simple failure shut down the
entire product! Assume default inputs that make some sense where
possible.
For example, on this trip while still 1000 miles from England
Amber's digital autopilot went insane. God knows how long we went
in circles till I woke up and realized there was a problem. After
an entire day of tracing the circuit and looking for the source
of the trouble I found that the front panel switches were wired
in parallel with an unused external connector. All were arranged
in a matrix scanned by the unit's 8051. These course setting switches
are used one at a time - never should more than one be pressed,
and a user would never hold a switch in for more than a few seconds.
The relentless sea found its way into the O-ring sealed computer
module and created a high-impedance short between scan lines.
The code was too simple-minded to reject the impossible signals
it received, and bizarrely steered us in circles.
Such poor code is inexcusable, as autopilots are famous for suffering
corrosion problems. Wealthier sailors usually carry three or four
units in hopes that one will survive a trip. Smarter software
could help keep customers a lot happier. The engineering costs
will be a bit higher, but the extra software costs nothing in
production if ROM space is available. If it isn't, the company
must weigh the cost of unhappy customers against a microcontroller
with more program space.
Still, the designers had addressed a similar problem, though perhaps
more to satisfy their own internal production requirements than
to deal with frantic mid-ocean repairs. While troubleshooting
the scan line short I disassembled the unit at Amber's chart table,
removed the circuit board, and clipped power to the computer so
I could trace out problems with a voltmeter. Unfortunately, with
the board removed an important rotary switch could not be connected
into the system. I feared the autopilot's firmware would find
that since no rotary switch input was presented, another "impossible"
condition, the code would go haywire, making tabletop diagnosis
difficult. In fact it worked even without this input, indicating
that the designers realized that during repair the unit's mechanical
construction was such that no input could be expected. The code
must have assumed some reasonable default value instead of looping
for an input.
Sure, in real life most embedded systems don't have to run partially
dismantled. Always remember that during production test and repair,
to say nothing of field repair, your carefully engineered package
might be violated. Technicians needing access to the components
will try to run it in pieces. If the system runs when opened up,
they'll have a much easier time probing with scopes and meters
to find faults.
Can your system run with important cables removed? What happens
if a cable is not connected when power is applied? Technicians
will try to connect as little as possible when troubleshooting
failed components. If the code won't run without a cable, they
might have to build extension wiring harnesses just to gain access
to the circuit boards. Certainly no one in the field will have
these harnesses. Where possible, make sure the code continues
to run in some fashion with some or all of the cables removed.
Informative Beeps
I've written extensively about software diagnostics in the past.
In the case of Amber's autopilot, an off-course alarm beeped incessantly
till I woke up and realized something was wrong. The unit gave
no help in figuring out just what the problem was, though a trivial
amount of code could have produced beep codes indicating which
switches seemed to be on. As it was, I spent most of a day isolating
the problem, not much fun in 10 foot seas.
With no feedback from the microcontroller, it's awfully hard to
differentiate between switch, electronics, actuator, or flux-gate
compass failures. In the July 1990 issue of Embedded Systems Programming
I wrote about using an LED to blink error codes. Your high tech
Ford with an under-dash computer has such a self-test mode: short
two wires together and it will produce a two digit code indicating
what sort of failures are where. This is embedded systems programming
with style!
Sure, sometimes embedded systems are essentially disposable in
event of failure. Mission-critical applications must be repairable,
and demand firmware that helps the user even when things fail.
It's important to sit in your customer's shoes when deciding what
is truly mission critical. If we couldn't fix Amber's autopilot
the two of us aboard would have had to steer, 24 hours a day,
for almost two weeks!
Software Failures
Similarly, never assume that the software is entirely glitch-free.
Yes, even your meticulously maintained and painfully debugged
code could very well harbor a latent problem. Even small embedded
systems are now getting frightfully complicated, making proving
software correctness all but impossible. After fixing a hundred
bugs, are you really sure there's not one or two obscure ones
still left?
It would be nice to write code that can survive any sort of software
bug but surely this is impossible. However, with a little forethought
you can usually craft firmware that, by its design, is robust
enough to handle many sorts of faults.
Always write exception handlers. The 80x88 traps on divide overflows,
yet a staggering number of DOS applications exit to the operating
system on a divide fault. Can you really guarantee that your application
will never do a divide by zero? Spend the few minutes needed to
write a short routine to gracefully recover from unanticipated
division problems. Other processors trap on other sorts of errors.
Always fill these trap vectors with some sort of recovery routine.
If the error is truly impossible (like a memory error) it might
make sense to report the problem and at least restart the code.
Any sort of service is better than leaving the vector unused,
a sure way to turn a little software bug into a dramatic crash.
Fill unused ROM and even RAM locations with a single byte opcode
that traps to a particular address, and then put a handler there.
For instance, the Z80/64180 goes to location 38 when executing
the RST7 (FF) opcode; the 80x88 picks up a vector at 0C after
executing INT3 (C4). The handler should try and recover gracefully,
perhaps by re-entering the program's main loop or even by restarting
the code. This approach gives he code a prayer of recovering despite
momentary hardware or software glitches that make the firmware
"wander off". Wandering code will likely wind up in
the middle of data or even in the middle of a multi-byte opcode.
There's not much we can do about this, but filling ROM and RAM
with a one-byte trap will improve the recovery odds quite a bit.
Be sure you can disable this extra-robust code during debugging.
You don't want these routines to mask real problems. Use a conditional
compile or runtime switch to vector error conditions to a breakpoint.
Similarly, during debugging always set your emulator, simulator,
or whatever to break on any access to unused locations. Otherwise,
how can you be sure the code isn't banging on locations it shouldn't
be? This is always a sign of a latent problem. I often hear from
folks whose software runs fine from system ROM but not from emulator
RAM, a sure sign of rogue code that is writing into code space.
Really complex loops always hold potential for locking up a system.
The world is indeed growing ever more complex, and our embedded
systems reflect this. Some equipment solves torturously difficult
series of equations before producing a result. Often, iterative
instead of deterministic algorithms are used to reduce matrices
or converge a series or integral. For example, Newton's method
involves solving the same equation repeatedly using the answer
from step "n" as the input to step "n+1",
continuing until the errors are below some arbitrary value. What
if the input data is such that a solution cannot be found within
specified precision? Sometimes iterative solutions can actually
start to diverge, rather than converge, making a solution impossible.
Iterative algorithms are fine as long as the software is smart
enough to detect that a solution is unlikely, and then give the
user some options. Locking up into an infinite loop is always
unacceptable.
On this transatlantic voyage our GPS hung several times trying to reduce crummy
data from weak signals or marginal satellite geometry. Worse,
even the software-controlled power switch wouldn't work when stuck
in this loop. The designers left no option but to remove the unit's
batteries, wait 30 minutes (!) and then restart it from scratch.
Of course, after a half hour without batteries we had to reload
dozens of setup parameters. Ironically, the restart required us
to figure our position with the centuries old method of celestial
navigation and preload the position into the GPS. A much better
design would make the iterative loop read the keypad and exit
when a key is pressed.
An even better approach might have been to use a real time operating
system, with one task always reading keys in the background. An
OS that runs some sort of keypad task will inherently prevent
well-behaved code from getting into un-exitable infinite loops.
Far too many years ago I worked on an 8008 based instrument that
used a Gauss-Siedel iteration to produce an answer. We programmed
it to escape the loop if the iteration proceeded for 20 minutes
without a solution (computers were a lot slower then). In this
case 7 segment LEDs displayed "HELP" to let the user
know no solution was possible. Years passed and the code was obsoleted
by an algorithm that converged quickly, every time. Memories of
the earlier version faded. One day an ashen faced technician came
to me and explained that he was repairing a very old unit. While
fiddling with it, it started flashing "HELP HELP", confirming
his long-held belief in the supernatural.
Never, never shut the user down. He bought your product to do
something. Try to keep the widget at least partially operational
no matter what might go wrong.
Brownouts
Embedded systems often quietly compute in the background, day
in and day out. You might be willing to re-setup a lab instrument
if a power outage caused the unit to reset, but this just is not
acceptable in a lot of other applications. I often wonder why
we put up with resetting every digital clock in the house after
even a 1 second power failure - in this day and age of CMOS there
is no technical reason why they shouldn't keep track of time for
a least a few minutes.
With the grid getting ever more overloaded we must expect line
power based equipment to have to deal with regular power shortages.
While it might be unreasonable to expect an embedded system to
continue operating without power, I do feel that some equipment
should at least reset to a reasonable mode when power is re-applied.
For example, a remote data acquisition site should start acquiring
data as soon as power is restored, rather than enter some sort
of setup mode. There may be no user available to press the "start"
key.
Can your critical equipment come back up without human intervention?
If this is an important design criterion, be sure the code recognizes
that the unit was at one point alive. If important variables are
protected in battery-backed RAM, then in most cases it's easy
to resume operation automatically. Be sure to maintain a checksum
of the really important parameters so the code knows if the machine's
data is intact.
On our sail we ran all of the boat's equipment from a pair of
12 volt batteries. Once a day we'd fire up the diesel to recharge
the cells. If we weren't careful to switch a full battery on-line
before cranking the engine, then the tremendous amount of current
needed by the starter motor dragged the entire 12V system down
to 8V or so, which reset every piece of electronics with an embedded
computer. None of the equipment was smart enough to carry on without
our help. We would be forced to reenter a course into the autopilot,
restart the radar, etc. The Loran was especially frustrating,
as sometimes we had to re-enter our initial position. In short,
no piece of embedded electronics was smart enough to remember
that it had been on after a brownout (a not unusual condition
on a cruising sailboat).
Especially frustrating was the digital log, which uses a tiny
paddlewheel to track roughly how many miles we've sailed. Even
the shortest power glitch made this unit reset distance traveled
to zero, which really interfered with navigation. I got in the
habit of writing down its distance reading before starting the
engine, and then manually accumulating these offsets. Someday
I'll put a diode and big capacitor in its power line, but a little
better design would have given this vendor a much happier customer.
An old business adage advises one to "stick to one's knitting"
- develop and sell products to markets you truly understand. If
you don't comprehend what your user expects, and haven't got lots
of experience operating in the industry, then you cannot make
a product that will really satisfy him. Make sure your widget
is designed to satisfy the user's real needs, in his real operating
environment.
Non-embedded Blues
Despite various frustrations with the boat's processors, the only
true tragedy struck our single non-embedded system, a DOS laptop
that suffered a fatal attack of salt water corrosion early on.
After years of leaning on a word processor crutch, it was a shock
to revert to archaic pen and paper. They say Mozart wrote his
music essentially without corrections - I wonder how true this
would have been if he worked on a MIDI machine with graphic editing.
But wait! The white whale's to windward! No more of this dull
plodding - helm alee!
|