Understand your User's Needs
Call me Ishmaell - Lessons from failures on a small boat at sea.
Published in Embedded Systems Programming
 |
For hints, tricks and ideas about better ways to build embedded systems, subscribe to The Embedded Muse, a free biweekly e-newsletter. No hype, just down to earth embedded talk. 23,000 other engineers subscribe. It takes just a few seconds (all we need is your email address, which is shared with absolutely no one) to subscribe to the Embedded Muse. |
Understand Your User's Needs
Call me Ishmael. I'm writing this in mid-Atlantic, bound
from Baltimore to Grand Turk aboard my 32 foot sailboat. Like Melville, I find
relief from the pressures of modern life by chasing adventure at sea.
Voyager, though 28 years old, nevertheless is a child of
the microprocessor revolution. Over the years I've added a lot of electronics to
make sailing easier and safer; each addition brings yet one more embedded system
aboard. With the exception of the beautifully designed digital VHF radio, every
microprocessor-based product on the boat suffers from one or more design defects
that erodes the equipment's usefulness just a bit.
The environment on a small boat far offshore couldn't be
worse for electrical items. Salt spray and high humidity insidiously find their
way inside even the best enclosures, rapidly corroding every soldered
connection. Switch contacts are the first to go, followed by connectors and the
tracks on PC boards. A partial solution is to use only the finest gold contacts,
an obvious approach all too few vendors employ.
While the marine environment is perhaps a bit extreme,
every system we build is subject to some level of mechanical and electrical
failure. Even in the most benign laboratory conditions contacts get dirty. It
makes sense to design code that will work in at least some fashion if, say, a
switch fails. In situations where failures are likely or inevitable, a wise
designer will devise software solutions, even if the system cannot continue to
run with complete functionality.
Given that dirty or corroded contacts are a perennial
source of trouble, firmware should always check input bits for validity.
Obviously, if only a single switch should ever be pressed at a time, then by all
means don't accept conditions where several are asserted. But be kind to your
users and take some reasonable action faced with unreasonable inputs. Can the
system ignore the extra bit and carry on? If one the software sees a switch
continuously pressed, it may make sense to assume there's a dufus user or failed
hardware.
A lot of systems use a debouncing algorithm that loops
forever if an input is shorted. Don't let a simple failure shut down the entire
product!
For example, on this trip Voyager's digital autopilot went
insane. God knows how long we went in circles till I woke up and realized there
was a problem. After an entire day of tracing the circuit and looking for the
source of the trouble I found that the front panel switches were wired in
parallel with an unused external connector. All were arranged in a matrix
scanned by the unit's 8051. These course setting switches are used one at a time
- never should more than one be pressed, and a user would never hold a switch in
for more than a few seconds. The relentless sea found its way into the O-ring
sealed computer module and created a high-impedance short between scan lines.
The code was too simple-minded to reject the impossible signals it received, and
bizarrely steered us in circles.
Such poor code is inexcusable, as autopilots are famous for
suffering corrosion problems. Wealthier sailors usually carry three or four
units in hopes that one will survive a trip. Smarter software could help keep
customers a lot happier. The engineering costs will be a bit higher, but the
extra software costs nothing in production if ROM space is available. If it
isn't, the company must weigh the cost of unhappy customers against a
microcontroller with more program space.
Still, the designers had addressed a similar problem,
though perhaps more to satisfy their own internal production requirements than
to deal with frantic mid-ocean repairs. While troubleshooting the scan line
short I disassembled the unit, removed the circuit board, and clipped power to
the computer so I could trace out problems with a voltmeter. Unfortunately, with
the board removed an important rotary switch could not be connected into the
system. I feared the autopilot's firmware would find that since no rotary switch
input was presented, another "impossible" condition, the code would go haywire,
making tabletop diagnosis difficult. In fact it worked even without this input,
indicating that the designers realized that during repair the unit's mechanical
construction was such that no input could be expected. The code must have
assumed some reasonable default value instead of looping for an input.
Sure, in real life most embedded systems don't have to run
partially dismantled. Always remember that during production test and repair, to
say nothing of field repair, your carefully engineered package might be
violated. Technicians will run the system with boards hanging out and connectors
dangling. If the system runs when opened up, they'll have a much easier time
probing with scopes and meters to find faults.
Can your system run with important cables removed? What
happens if a cable isn't connected when power is applied? If the code won't run
without a cable, a technician might have to build extension wiring harnesses
just to gain access to the circuit boards. Certainly no one in the field will
have these harnesses. Where possible, make sure the code continues to run in
some fashion with some or all of the cables removed.
Informative Beeps
I've written extensively about software diagnostics in the
past. In the case of our autopilot, an off-course alarm beeped incessantly till
I woke up and realized something was wrong. The unit gave no help in figuring
out just what the problem was, though a trivial amount of code could have
produced beep codes indicating which switches seemed to be on. As it was, I
spent most of a day isolating the problem, not much fun in heavy seas.
With no feedback from the microcontroller, it's awfully
hard to differentiate between switch, electronics, actuator, or flux-gate
compass failures. Why not use an LED to blink error codes? Your Ford has such a
self-test mode: short two wires together and it will produce a two digit code
indicating what sort of failures are where. This is embedded systems programming
with style!
Sure, sometimes embedded systems are essentially disposable
in event of failure. Mission-critical applications must be repairable, and
demand firmware that helps the user even when things fail. It's important to sit
in your customer's shoes when deciding what is truly mission critical. If we
couldn't fix Voyager's autopilot the two of us aboard would have had to steer,
24 hours a day, for almost two weeks! That's too much like working.
Software Failures
Similarly, never assume that the software is entirely
glitch-free. Yes, even your meticulously maintained and painfully debugged code
could very well harbor a latent problem. Even small embedded systems are now
getting frightfully complicated. When programs were 4k long it was reasonable to
demand bug-free code. Today's multi-megaline systems will always have some
lurking bugs.
It would be nice to write code that can survive any sort of
software bug but surely this is impossible. However, with a little forethought
you can usually craft firmware that, by its design, is robust enough to handle
many sorts of faults.
Always write exception handlers. You might not expect a
divide overflow or a spurious interrupt, but strange stuff does happen. The
unexpected turns out to be more likely than one would think. Test those
routines carefully. I'm giving a talk at the Boston ESC this month about
embedded disasters, and barely-tested exception handlers that don't quite work
right are at least partially responsible for over half of the examples I'll
discuss.
If the error is one for which there's no decent recovery
strategy - like a memory error - it might make sense to report the problem and
at least restart the code. Any sort of service is better than a dramatic crash.
Fill unused ROM and even RAM locations with a single byte
opcode that traps to a particular address, and then put a handler there. For
instance, the Z80/64180 goes to location 0x38 when executing the RST7 (FF)
opcode; the 80x88 picks up a vector at 0C after executing INT3 (C4). The handler
should try and recover gracefully, perhaps by re-entering the program's main
loop or even by restarting the code. This approach gives he code a prayer of
recovering despite momentary hardware or software glitches that make the
firmware "wander off". Wandering code will likely wind up in the middle of data
or even in the middle of a multi-byte opcode. There's not much we can do about
this, but filling ROM and RAM with a one-byte trap will improve the recovery
odds quite a bit.
Be sure you can disable this extra-robust code during
debugging. You don't want these routines to mask real problems. Use a
conditional compile or runtime switch to vector error conditions to a
breakpoint.
Similarly, during debugging always set your emulator,
simulator, or whatever to break on any access to unused locations. Otherwise,
how can you be sure the code isn't banging on locations it shouldn't be? This is
always a sign of a latent problem. I often hear from folks whose software runs
fine from system ROM but not from emulator RAM, a sure sign of rogue code that
is writing into code space. Over the last few years half of the systems I've
examined do spurious reads and writes, sure signs of a latent bug.
Really complex loops always hold potential for locking up a
system. The world is indeed growing ever more complex, and our embedded systems
reflect this. Some equipment solves torturously difficult series of equations
before producing a result. For instance sometimes we use iterative instead of
deterministic algorithms to reduce matrices or converge a series. Newton's
method involves solving the same equation repeatedly using the answer from step
"n" as the input to step "n+1", continuing until the errors are below some
arbitrary value. What if the input data is such that a solution cannot be found
within specified precision? Sometimes iterative solutions can actually start to
diverge, rather than converge, making a solution impossible. Iterative
algorithms are fine as long as the software is smart enough to detect that a
solution is unlikely, and then give the user some options. Locking up into an
infinite loop is always unacceptable.
On this voyage our GPS hung several times trying to reduce
crummy data from weak signals or marginal satellite geometry. Worse, even the
software-controlled power switch wouldn't work when stuck in this loop. The
designers left no option but to remove the unit's batteries, wait 30 minutes (!)
and then restart it from scratch. Of course, after a half hour without batteries
we had to reload dozens of setup parameters. Ironically, the restart required us
to figure our position with the centuries old method of celestial navigation and
preload the position into the GPS. A much better design would make the iterative
loop read the keypad and exit when a key is pressed.
An even better approach might have been to use a real time
operating system, with one task always reading keys in the background. An OS
that runs some sort of keypad task will inherently prevent well-behaved code
from getting into un-exitable infinite loops.
Far too many years ago I worked on an 8008 based instrument
that used a Gauss-Siedel iteration to produce an answer. We programmed it to
escape the loop if the iteration proceeded for 20 minutes without a solution
(computers were a lot slower then). In this case 7 segment LEDs displayed "HELP"
to let the user know no solution was possible. Years passed and the code was
obsoleted by an algorithm that converged quickly, every time. Memories of the
earlier version faded. One day an ashen faced technician came to me and
explained that he was repairing a very old unit. While fiddling with it, it
started flashing "HELP HELP", confirming his long-held belief in the
supernatural.
Never, never shut the user down. He bought your product to
do something. Try to keep the widget at least partially operational no matter
what might go wrong.
Brownouts
Embedded systems often quietly compute in the background,
day in and day out. You might be willing to re-setup a lab instrument if a power
outage caused the unit to reset, but this just is not acceptable in a lot of
other applications. I often wonder why we put up with resetting every digital
clock in the house after even a 1 second power failure - in this day and age
there is no technical reason why they shouldn't keep track of time for a least a
few minutes.
With the grid getting ever more overloaded we must expect
line power based equipment to have to deal with regular power shortages. While
it might be unreasonable to expect an embedded system to continue operating
without power, I do feel that some equipment should at least reset to a
reasonable mode when power is re-applied. For example, a remote data acquisition
site should start acquiring data as soon as power is restored, rather than enter
some sort of setup mode. There may be no user available to press the "start"
key.
Can your critical equipment come back up without human
intervention? If this is an important design criterion, be sure the code
recognizes that the unit was at one point alive. If important variables are
protected in Flash or battery-backed RAM, then in most cases it's easy to resume
operation automatically. Be sure to maintain a checksum of the really important
parameters so the code knows if the machine's data is intact.
On our sail we run all of the boat's equipment from a pair
of 12 batteries, recharged daily by the diesel. If we we're not careful to
switch a full battery on-line before cranking the engine, then the tremendous
amount of current needed by the starter motor dragged the entire 12V system down
to 8V or so, which reset every piece of electronics with an embedded computer.
None of the equipment was smart enough to carry on without our help. We would be
forced to reenter a course into the autopilot, restart the radar, etc. No piece
of embedded electronics was smart enough to remember that it had been on after a
brownout (a not unusual condition on a cruising sailboat).
An old business adage advises one to "stick to one's
knitting" - develop and sell products to markets you truly understand. If you
don't deeply understand what your user expects, and haven't got lots of
experience operating in the industry, then you cannot make a product that will
really satisfy him or her. Make sure your widget is designed to satisfy the
user's real needs, in the real gritty world, not just under lab conditions.
But wait! The white whale's to windward! No more of this
dull plodding - helm alee!
|