As Good As It Gets
 |
For hints, tricks and ideas about better ways to build embedded systems, subscribe to The Embedded Muse, a free biweekly e-newsletter. No hype, just down to earth embedded talk. 23,000 other engineers subscribe. It takes just a few seconds (all we need is your email address, which is shared with absolutely no one) to subscribe to the Embedded Muse. |
As Good As It Gets
How good does firmware have to be? How good can it
be? Is our search for perfection, or near-perfection an exercise in futility?
Complex systems are a new thing in this world. Many of us
remember the early transistor radios which sported a half dozen active devices,
max. Vacuum tube televisions, common into the 70s, used 15 to 20 tubes, more or
less equivalent to about the same number of transistors. The 1940s-era ENIAC
computer required 18,000 tubes, so many that technicians wheeled shopping carts
of spares through the room, constantly replacing those that burned out. Though
that sounds like a lot of active elements, even the 25 year old Z80 chip used a
quarter of that many transistors, in a die smaller than just one of the hundreds
of thousands of resistors in the ENIAC.
Now the Pentium IV, merely one component of a computer, has
45 million transistors. A big memory chip might require a third of a billion.
Intel predicts that later this decade their processors will have a billion
transistors. I'd guess that the very simplest of embedded systems, like an
electronic greeting card, requires thousands of active elements.
Software has grown even faster, especially in embedded
applications. In 1975 10,000 lines of assembly code was considered huge. Given
the development tools of the day - paper tape, cassettes for mass storage, and
crude teletypes for consoles - working on projects of this size was very
difficult. Today 10,000 lines of C - representing perhaps 3 to five times as
much assembly - is a small program. A cell phone might contain a million lines
of C or C++, astonishing considering the device's small form factor and
miniscule power requirements.
Another measure of software size is memory usage. The 256
byte (that's not a typo) EPROMs of 1975 meant even a measly 4k program used 16
devices. Clearly, even small embedded systems were quite pricey.
Today? 128k of Flash is nothing, even for a tiny app. The switch from 8
to 16 bit processors, and then from 16 to 32 bitters, is driven more by
addressing space requirements than raw horsepower.
In the late 70s Seagate introduced the first small
Winchester hard disk, a 5 Mb 10 pound beauty that cost $1500. 5 Mb was more disk
space than almost anyone needed. Now 20 Gb fits into a shirt pocket, is almost
free, and fills in the blink of an eye.
So, our systems are growing rapidly in both size and
complexity. And, I contend, in failure modes. Are we smart enough to build these
huge applications correctly?
It's hard to make even a simple application perfect; big
ones will possibly never be faultless. As the software grows it inevitably
becomes more intertwined; a change
in one area impacts other sections, often profoundly. Sometimes this is due to
poor design; often, it's a necessary effect of system growth.
The hardware, too, is certainly a long way from perfect.
Even mature processors usually come with an errata sheet, one that can rival the
datasheet in size. The infamous Pentium divide bug was just one of many bugs -
even today the Pentium 3's errata sheet (renamed "specification update")
contains 83 issues. Motorola documents nearly a hundred problems in the MPC555.
I salute the vendors for making these mistakes public. Too
many companies frustrate users by burying their mistakes.
What is the current state of the reliability of embedded
systems? No one knows. It's an area devoid of research. Yet a lot of raw data
is available, some of which suggests we're not doing well.
The Mars Pathfinder mission succeeded beyond anyone's
dreams, despite a significant error that crashed the software during the
lander's descent. A priority inversion problem - noticed on Earth but
attributed to a glitch and ignored - caused numerous crashes. A well-designed
watchdog timer recovery strategy saved the mission. This was a very instructive
failure as it shows the importance of adding external hardware and/or software
to deal with unanticipated software errors.
The August 15, 2001 issue of the Journal of the American
Medical Association contained a study of recalls of pacemakers and implantable
cardioverter-defibrillators. (Since these devices are implanted subcutaneously I
can't imagine how a recall works). Surely designers of these devices are on
the cutting edge of building the very best software. I hope. Yet between 1990
and 2000 firmware errors accounted for about 40% of the 523,000 devices
recalled.
Over the ten years of the study, of course, we've learned
a lot about building better code. Tools have improved and the amount of real
software engineering that takes place is much greater. Or so I thought. Turns
out that the annual number of recalls between 1995 and 2000 increased.
In defense of the pacemaker developers, no doubt they solve
very complex problems. Interestingly, heart rhythms can be mathematically
chaotic. A slight change in stimulus can cause the heartbeat to burst into quite
unexpected randomness. And surely there's a wide distribution of heart
behavior in different patients.
Perhaps a QA strategy for these sorts of life-critical
devices should change. What if the responsible
person were one with heart disease! who had to use the latest widget before
release to the general public?
A pilot friend tells me the 747 operator's manual is a
massive tome that describes everything one needs to know about the aircraft and
its systems. He says that fully half of the book documents avionics (read:
software) errors and workarounds.
The Space Shuttle's software is a glass
half-empty/half-full story. It's probably the best code ever written, with an
average error rate of about one per 400,000 lines of code. The cost: $1000 per
line. So, it is possible to write great code, but despite paying vast sums
perfection is still elusive. Like the 747, though, the stuff works "good
enough", which is perhaps all we can ever expect.
Is this as good as it gets?
The Human Factor
Let's remember we're not building systems that live in
isolation. They're all part of a much more complex interacting web of other
systems, not the least of which is the human operator or user. When tools were
simple - like a hammer or a screwdriver - there weren't a lot of complex
failure modes. That's not true anymore. Do you remember the USS Vincennes? She
is a US Navy battle cruiser, equipped with the incredibly sophisticated Aegis
radar system. In July, 1988 the cruiser shot down an Iranian airliner over the
Persian Gulf, killing all 290 people on board. Apparently the system knew that
the target wasn't an incoming enemy warplane, but the data was displayed on a
number of terminals that weren't easy to see. So here's a failure where the
system worked as designed, but the human element created a terrible failure. Was
the software perfect since it met the requirements?
Unfortunately, airliners have become common targets for
warplanes. This past October a Ukrainian missile apparently shot down a Sibir
Tu-154 commercial jet, killing all 78 passengers and crew. As I write the cause
is unknown, or unpublished, but local officials claim the missile had been
targeted on a close-by drone. It missed, flying 150 miles before hitting the
jet. Software error? Human error?
The war in Afghanistan shows the perils of mixing men and
machines. At least one smart bomb missed its target and landed on civilians. US
military sources say wrong target data was entered. Maybe that means someone
keyed in wrong GPS coordinates. It's easy to blame an individual for
mistyping! but doesn't it make more sense to look at the entire system as a
whole, including bomb and operator? Bombs have pretty serious safety-critical
aspects. Perhaps a better design would accept targeting parameters in a string
that includes a checksum, rather like credit card numbers. A mis-keyed entry
would be immediately detected by the machine.
It's well-known that airplanes are so automated that on
occasion both pilots have slipped off into sleep as the craft flies itself.
Actually, that doesn't really bother me much, since the autopilot beeps when
at the destination, presumably waking the crew. But, before leaving the fliers
enter the destination in latitude/longitude format into the computers. What if
they make a mistake (as has happened)? Current practice requires pilot and
co-pilot to check each other's entries, which will certainly reduce the chance
of failure. Why not use checksummed data instead and let the machine validate
the data?
Another US vessel, the Yorktown, is part of the Navy's
"Smart Ship" initiative. Hugely automating the engineering (propulsion)
department reduces crew needs by 10% and saves some $2.8 million per year on
this one ship. Yet the computers create new vulnerabilities. Reports suggest
that an operator entered an incorrect parameter which resulted in a
divide-by-zero error. The entire network of Windows NT machines crashed. The
Navy claims the ship was dead in the water for about three hours; other sources
(http://www.gcn.com/archives/gcn/1998/july13/cov2.htm)
claim it was towed into port for two days of system maintenance. Users are now
trained to check their parameters more carefully. I can't help wonder what
happens in the heat of battle, when these young sailors may be terrified, with
smoke and fire perhaps raging. How careful will the checks be?
Some readers may also shudder at the thought of NT
controlling a safety-critical system. I admire the Navy's resolve to use a
commercial, off the shelf product, but wonder if Windows, which is the target of
every hacker's wrath, might not itself create other vulnerabilities. Will the
next war be won by the nation with the best hackers?
A plane crash in Florida, in which software did not
contribute to the disaster, was a classic demonstration of how difficult it is
to put complicated machines in the hands of less-than-perfect people. An
instrument lamp burned out. It wasn't an important problem, but both pilots
became so obsessed with tapping on the device they failed to notice that the
autopilot was off. The plane very gently descended till it crashed, killing
everyone.
People will always behave in unpredictable ways, leading to
failures and disasters with even the best system designs. As our devices grow
more complex their human engineering becomes ever more important. Yet all too
often this is neglected in our pursuit of technical solutions.
Solutions?
I'm a passionate believer in the value of firmware
standards, code inspections, and a number of other activities characteristic of
disciplined development. It's my observation that an ad hoc or a non-existent
process generally leads to crummy products. Smaller systems can succeed from the
dedication of a couple of overworked experts, but as things scale up in size
heroics becomes less and less successful.
Yet it seems an awful lot of us don't know about basic
software engineering rules. When talking to groups I usually ask how many
participants have (and use) rules about the maximum size of a function. A basic
rule of software engineering is to limit routines to a page or less. Yet only
rarely does anyone raise their hand. Most admit to huge blocks of code,
sometimes thousands of lines. Often this is a result of changes and revisions,
of the code evolving over the course of time. Yet it's a practice that
inevitably leads to problems.
By and large methodologies have failed. Most are too big,
too complex, or too easy to thwart and subvert. I hold great hopes for UML,
which seems to offer a way to build products that integrates hardware and
software, and that is an intrinsic part of development from design to
implementation. But UML will fail if management won't pay for quite extensive
training, or toss the approach when panic reigns.
The FDA, FAA, and other agencies are slowing becoming aware
of the perils of poor software, and have guidelines that can improve
development. Britain's MISRA (Motor Industry Software Reliability Association)
has guidelines for the safer use of C. They feel that we need to avoid certain
constructs and use others in controlled ways to eliminate potential error
sources. I agree. Encouragingly, some tool vendors (notably Tasking) offer
compilers that can check code against the MISRA standard. This is a powerful aid
to building better code.
I doubt, though, that any methodology or set of practices
can, in the real world of schedule pressures and capricious management, lead to
perfect products. The numbers tell the story. The very best use of code
inspections, for example, will detect about 70% of the mistakes before testing
begins. (However, inspections will find those errors very cheaply). That
suggests that testing must pick up the other 30%. Yet studies show that often
testing checks only about 50% of the software!
Sure, we can (and must) design better tests. We can, and
should, use code coverage tools to insure every execution path runs. These all
lead to much better products, but not to perfection. Because all of the code is
known to have run doesn't mean that complex interactions between inputs
won't lead to bizarre outputs. As the number of decision paths increases -
as the code grows - the difficulty of creating comprehensive tests skyrockets.
When time to market dominates development, quality
naturally slips. If low cost is the most important parameter, we can expect more
problems to slip into the product.
Software is astonishingly fragile. One wrong bit out of a
hundred million can bring a massive system down. It's amazing that things work
as well as they do!
Perhaps the nature of engineering is that perfection itself
is not really a goal. Products are as good as they have to be. Competition is a
form of evolution that often does lead to better quality. In the 70s Japanese
automakers, who had practically no US market share, started shipping cars that
were reliable and cheap. They stunned Detroit, which was used to making a shoddy
product which dealers improved and customers tolerated. Now the playing field
has leveled, but at an unprecedented level of reliability.
Perfection may elude us, but we must be on a continual
quest to find better ways to build our products. Wise developers will spend
their entire careers engaged in the search.
|