Is Hardware Reliable
 |
For hints, tricks and ideas about better ways to build embedded systems, subscribe to The Embedded Muse, a free biweekly e-newsletter. No hype, just down to earth embedded talk. 23,000 other engineers subscribe. It takes just a few seconds (all we need is your email address, which is shared with absolutely no one) to subscribe to the Embedded Muse. |
Is Hardware Reliable
Safety-critical systems that rely on firmware are
inherently problematic. Outside of some academic work that hasn't made it into
the real world, there's no way to "prove" that software is correct. We can
do a careful design, work out failure modes, and then perform exhaustive
testing, but the odds are bugs will still lurk in any sizeable chunk of code.
DO-178B and other standards exist that help insure reliability, but guarantees
are elusive at best. No one really knows how to make a perfect program when
sizes reach into the hundreds of thousands of lines and beyond.
Folks building avionics, medical instruments, power plant
controllers, and other critical systems must respond to customers' cries for
more features and capabilities, while providing ever-better reliability. The
standard seems to be perfection, which is unattainable with the current state of
the art.
It seems that in a capitalistic economy the edge between
increasing complexity and needed correctness is mediated - not well of course
- by litigation. I've no doubt that some horrible accident is coming, one
that will be attributed to a flawed embedded system. Inevitably it will be
followed by millions in lawyers' fees and more to the victims. This will be a
wake-up call to high level corporate executives, who will suddenly understand
the risks of building computer-based products.
Perhaps there's a parallel to the green movement, which
has shown that the "real" cost of many products, once you factor in
environmental degradation and clean-up bills, greatly exceeds the sell-price. I
suspect that it won't be till the looming embedded disaster arrives, and sadly
only then, that management will start to understand the true costs of building
complex systems. Today too many view software quality as a nice feature if
there's time, but really figure the company can always reflash the system when
the usual bugs surface. As embedded products control ever more of our world, in
ever more interconnected and complex ways - maybe in ways that exceed our
understanding - quality will have to become the foundation upon which all
other decisions are made.
Since the dawn of the embedded world there's been an
implicit distrust of the software. Controlling a dangerous mechanical device?
There's always a hardware-based safety interlock that will take over in case
of a crash. The hardware will save the day.
We generally assume the hardware is perfect. Yes, wise
developers create designs that fail
into a safe state, since components do die. But mostly we believe in
catastrophic problems, where an IC just stops functioning correctly, a
mechanical shock breaks something, or a poor solder joint, stressed by thermal
cycles, lets a chip lead pop free.
Hardware is deterministic. If it's working, well, it
works. Until it breaks.
What about glitches? You know, that erratic and
inexplicable thing the system does, that we can never reproduce. Perhaps after a
struggle we give up. A wizened old technician taught me "if it ain't broke,
I can't fix it", the very model for accepting rare non-repeatable events.
An example was the Mars Pathfinder, the spectacularly
successful mission that landed a rover on the Red Planet. As the lander
descended the software crashed, repeatedly, the mission saved by a watchdog
timer and a beautiful recovery strategy. A priority inversion error caused the
crashes. The developers saw this crash on Earth, pre-launch, twice, and
attributed it to a glitch. Unable to reproduce the effect, they shipped the unit
to Mars.
But of course that was a software problem. The hardware was
perfect.
Years ago I learned to never allow wire-wrapped printed
circuit boards in prototypes. Wire-wrap circuits often exhibit squirrelly
problems due to long lead lengths and crosstalk between wires. Much worse was a
human problem. Engineers invariably attributed every strange and irregular
behavior to the lousy wire-wrapped prototype. "Once we go to PCB everything
will be fine. This is just a glitch."
Oh yeah? Are you sure? Are you making an easy assumption or
working with real data? All too often the same problem surfaced in the circuit
boards, necessitating yet another design respin. "It's just a glitch" is
equivalent to "something is terribly wrong and we don't have a clue what's
going on." All too often the
culprit was a timing error, one that created observable symptoms only rarely.
When I was in the emulator business it was shocking how
many target systems we evaluated exhibited erratic memory problems. Our
product's built-in tests pounded hard on memory. All too often they reported
erratic, unrepeatable, yet undeniably serious problems, especially with big RAM
arrays. The tests were designed to create worst-case situations of heavy bus
switching. Tiny amounts of power supply droop, exacerbated by the massive
transitions, created occasional failures. Impedance issues were even more
common: the test patterns tossed out at the highly capacitive RAMs created so
much ringing the logic sometimes couldn't differentiate between zeroes and
ones.
Wise managers won't tell the engineers they distrust
their judgment. Wiser ones actually trust their people. The whole point of
hiring is to delegate both work and responsibility to another person. But I do
think it makes sense to institute a "no glitch" policy. Until a prototype
works reliably, or until the cause of a problem is well understood, we keep
improving the system design in the pursuit of a high quality product.
Unreliable Chips
So sometimes the hardware can be less than perfect, though
in these examples the problems all stem from design flaws. More careful design,
better reviews and additional testing should bring such systems towards
perfection. It's the software that's going to be the problem, since the
hardware is but a platform that runs those instructions faithfully, no matter
how absurd they may be.
The times may be changing. According to Intel, any
assumption that processors just do what they are told is now wrong. Of course
the latest CPUs will run your program - and pretty darn fast, at that. But
once in a while you can expect a bit in the instruction stream to flip at
random. No doubt something awful will result. Bummer, that, but it's the price
of using high technology. We can have high speed, just not predictability.
Scared yet? I was, when reading the September 3, 2001 issue
of EE Times about Intel's McKinley processor. One of it's neat features is
an L3 cache located on-board the CPU. This about doubles some performance
figures.
The cache is big and the geometry of the part quite tiny.
It's fabricated with .18 micron line widths, which is astonishingly small but
no longer bleeding edge. That's 180 nanometers, or something like a third of
the wavelength of red light. When light wavelengths look big, surely we've
falling through the looking glass into the realm of the fantastic.
In the article Intel acknowledged that this small size
coupled with the large surface area of the four megabyte L3 cache creates a
serious target for incoming cosmic rays. There's very little energy difference
in such small parts between a zero and a one, so occasionally the cache may get
struck by an incoming particle of just the right energy to flip a bit. What
happens next is hard to predict. Perhaps the processor will execute a
cache-miss, causing a flush and reload. But maybe not. A cache with wrong data,
even when it's just a single bit in error, is a disaster waiting to happen.
How bad is the problem? Apparently they've looked at this
statistically, and feel that odds are the average user will see less than one
bit flip per thousand years. That's a very long time. I can't help but
wonder if Intel can justify a 1000 year MTBF partly because no one expects their
Windows-based PC to run for more than a day or two between crashes anyway.
Can embedded systems live with a 1000 year MTBF? Perhaps.
Unless one considers that high volume production means worldwide, cache bits
will flip rather often. Build a million of a widget with these reliability
parameters, and statistics suggest 1000 bit flips per year, divided among all of
the deployed products. That is, maybe 1000 crashes/year due just to this one
infrequently encountered hardware flaw.
Suppose the airlines retrofit all commercial planes with a
super-gee-whiz new kind of avionics that includes just a single McKinley CPU. I
read somewhere that, on average, there are 7000 airliners flying at any time.
Will this cause 7 crashes per year? Or maybe even more, considering that at high
altitudes radiation sources punch more energy.
How ironic that a huge plane might be downed by a collision
with a single elementary particle!
This is not a rant against Intel or the McKinley part. I
think the company has courage to admit there's an issue. The problem stems
from the evolution of our technology. All vendors pushing into the deep
submicron will have to grapple with erratic hardware as transistor sizes
approach quantum limits. The PC industry generally leads embedded by a few years
as they pioneer high density parts, but we do follow not behind. We embedded
developers can expect to see these sorts of problems in our systems before too
long.
If 180 nanometer line widths are subject to cosmic rays
then what does the future hold? Most roadmaps suggest that leading-edge parts
will use 50 nanometer geometries within 10 years. To put this in perspective, a
hydrogen atom is about an angstrom across. In a decade our line widths will span
a mere 500 atoms. It's hard to conceive of building anything so small,
especially when a single chip will have a billion transistors made from these
tiny building blocks.
Intel predicts these near-future parts will run at 20 GHz.
Even CISC parts are evolving towards the RISC ideal of one clock per
instruction, so that processor will execute an instruction in 50 picoseconds! So
not only are the transistors miniscule, they operate at speeds that give RF
designers fits.
Heat, too, conspires against chip designers. The McKinley
reportedly dissipates 130 watts. A few of these CPUs would make a fine toaster,
and no doubt that much embedded intelligence will insure the perfect piece of
toast, every time.
I can't find a McKinley datasheet, but the previous
generation Itanium shows a max current requirement (processor plus cache) of
about 90 amps. This is not a misprint: 90 amps.
That's not much less a Honda's engine cranking current. It will drain
a Die Hard in half an hour.
The part comes with quite an elaborate thermal management
subsystem that automatically degrades performance as temperature rises, to avoid
self-destruction.
Heat has always been the enemy of electronics. High
temperatures destroy semiconductors. Thermal stresses can cause leads to pop
free of circuit boards. Connectors unseat themselves, PCBs change shape, and all
sorts of mechanical issues stemming from high temps reduce system reliability. I
can't help but wonder if powering systems up and down, with the resulting
heating/cooling cycles, will create failures. Once we thought semiconductors
lasted forever. Will this be true in the future?
It's really quite marvelous that a $1000 PC running at a
GHz or so works as well as it does. Think about the Sagan-like billions
of bits - maybe closer to hundreds of billions - running around your machine
each second. Even when the machine is "idle" (whatever that means) it's
still slinging data at rates undreamt of a decade ago. Yet if just one of those
bits flips, or is misinterpreted, or goes unstable for a nanosecond or two, the
machine will crash.
Our traditional assumption of reliable hardware will not be
valid in the high density, very fast, and quite hot world that's just around
the corner. Intel's disclosure of cosmic ray vulnerabilities may actually be a
wake-up call to the computer community that it's time to rethink our designs.
Perhaps the vendors will counter with redundant hardware.
In some cases this makes a lot of sense. It's pretty easy - though expensive
- to widen memory arrays and add error correction codes that detect and
correct one and two bit errors. Much more difficult is
creating reliable CPU redundancies for a reasonable price. Today the
cache is at risk, but in the 50 nanometer 1 billion transistor processors of the
near future, surely much of the part's internal logic will be susceptible to
random bit flips as well. The space shuttle manages hardware failures using five
"voting" computers, so we know it's possible to create such redundancy.
But the costs are astronomical. The shuttle is an economic anomaly funded by
taxpayer money - there's plenty more cash where that came from. Most
embedded systems must adhere to a much less generous financial model.
Nothing New
But the hardware has always been unreliable. Older
engineers will remember the transition to 16K (16384 bits, that is) DRAMs in the
70s. New process techniques produced memories subject to cosmic rays. Here we
are a quarter century later with similar problems! Most of us at the time fought
with flaky DRAM arrays, blaming our designs when the chips themselves were at
fault. New kinds of plastic packages were invented, until the vendors discovered
that some plastics generated alpha particles, which also flipped bits. After a
year or two the problems were resolved.
With the dawn of the PC age DRAM demands skyrocketed. The
parts were much less perfect than those we use today, so all PCs used 9 bit wide
memory, the 9th bit being devoted to parity. Today's reliable parts
has mostly eliminated this - when was the last time you saw a parity error on a
PC?
In the embedded world we design products that might be used
for decades. Yet some components won't last that long. Many vendors only
guarantee that their EPROMs, EEROMs and Flash devices will retain programmed
data for ten years (some, like AMD, offer parts that remember much longer).
After a decade, the program might flit away like a puff of smoke in the wind.
Many vendors don't spec a data retention time, leaving designers to speculate
just when the firmware will become even less than the ghost in the machine.
So we've had some issues with hardware reliability
already, but have mostly dealt with the problem by ignoring it, assuming we'll
be safely employed elsewhere when the day of reckoning comes.
I suspect we're going to see new kinds of firmware
development techniques, designed to manage erratic hardware behavior. One
obvious approach is a memory management unit that wraps protected areas around
each task. A crash brings just a single task down, though what action the
software takes at that point becomes very problematic. Perhaps future reliable
systems will need a failure mode analysis that includes recovery from such
crashes.
Firmware complexity will soar.
|