Is Hardware Reliable

By Jack Ganssle

Is Hardware Reliable

Published in Embedded Systems Programming January, 2002

Safety-critical systems that rely on firmware are inherently problematic. Outside of some academic work that hasn't made it into the real world, there's no way to "prove" that software is correct. We can do a careful design, work out failure modes, and then perform exhaustive testing, but the odds are bugs will still lurk in any sizeable chunk of code. DO-178B and other standards exist that help ensure reliability, but guarantees are elusive at best. No one really knows how to make a perfect program when sizes reach into the hundreds of thousands of lines and beyond.

Folks building avionics, medical instruments, power plant controllers, and other critical systems must respond to customers' cries for more features and capabilities, while providing ever-better reliability. The standard seems to be perfection, which is unattainable with the current state of the art.

It seems that in a capitalistic economy the edge between increasing complexity and needed correctness is mediated - not well of course - by litigation. I've no doubt that some horrible accident is coming, one that will be attributed to a flawed embedded system. Inevitably it will be followed by millions in lawyers' fees and more to the victims. This will be a wake-up call to high level corporate executives, who will suddenly understand the risks of building computer-based products.

Perhaps there's a parallel to the green movement, which has shown that the "real" cost of many products, once you factor in environmental degradation and clean-up bills, greatly exceeds the sell-price. I suspect that it won't be till the looming embedded disaster arrives, and sadly only then, that management will start to understand the true costs of building complex systems. Today too many view software quality as a nice feature if there's time, but really figure the company can always reflash the system when the usual bugs surface. As embedded products control ever more of our world, in ever more interconnected and complex ways - maybe in ways that exceed our understanding - quality will have to become the foundation upon which all other decisions are made.

Since the dawn of the embedded world there's been an implicit distrust of the software. Controlling a dangerous mechanical device? There's always a hardware-based safety interlock that will take over in case of a crash. The hardware will save the day.

We generally assume the hardware is perfect. Yes, wise developers create designs that fail into a safe state, since components do die. But mostly we believe in catastrophic problems, where an IC just stops functioning correctly, a mechanical shock breaks something, or a poor solder joint, stressed by thermal cycles, lets a chip lead pop free.

Hardware is deterministic. If it's working, well, it works. Until it breaks.

What about glitches? You know, that erratic and inexplicable thing the system does, that we can never reproduce. Perhaps after a struggle we give up. A wizened old technician taught me "if it ain't broke, I can't fix it", the very model for accepting rare non-repeatable events.

An example was the Mars Pathfinder, the spectacularly successful mission that landed a rover on the Red Planet. As the lander descended the software crashed, repeatedly, the mission saved by a watchdog timer and a beautiful recovery strategy. A priority inversion error caused the crashes. The developers saw this crash on Earth, pre-launch, twice, and attributed it to a glitch. Unable to reproduce the effect, they shipped the unit to Mars.

But of course that was a software problem. The hardware was perfect.

Years ago I learned to never allow wire-wrapped printed circuit boards in prototypes. Wire-wrap circuits often exhibit squirrelly problems due to long lead lengths and crosstalk between wires. Much worse was a human problem. Engineers invariably attributed every strange and irregular behavior to the lousy wire-wrapped prototype. "Once we go to PCB everything will be fine. This is just a glitch."

Oh yeah? Are you sure? Are you making an easy assumption or working with real data? All too often the same problem surfaced in the circuit boards, necessitating yet another design respin. "It's just a glitch" is equivalent to "something is terribly wrong and we don't have a clue what's going on." All too often the culprit was a timing error, one that created observable symptoms only rarely.

When I was in the emulator business it was shocking how many target systems we evaluated exhibited erratic memory problems. Our product's built-in tests pounded hard on memory. All too often they reported erratic, unrepeatable, yet undeniably serious problems, especially with big RAM arrays. The tests were designed to create worst-case situations of heavy bus switching. Tiny amounts of power supply droop, exacerbated by the massive transitions, created occasional failures. Impedance issues were even more common: the test patterns tossed out at the highly capacitive RAMs created so much ringing the logic sometimes couldn't differentiate between zeroes and ones.

Wise managers won't tell the engineers they distrust their judgment. Wiser ones actually trust their people. The whole point of hiring is to delegate both work and responsibility to another person. But I do think it makes sense to institute a "no glitch" policy. Until a prototype works reliably, or until the cause of a problem is well understood, we keep improving the system design in the pursuit of a high quality product.

Unreliable Chips

So sometimes the hardware can be less than perfect, though in these examples the problems all stem from design flaws. More careful design, better reviews and additional testing should bring such systems towards perfection. It's the software that's going to be the problem, since the hardware is but a platform that runs those instructions faithfully, no matter how absurd they may be.

The times may be changing. According to Intel, any assumption that processors just do what they are told is now wrong. Of course the latest CPUs will run your program - and pretty darn fast, at that. But once in a while you can expect a bit in the instruction stream to flip at random. No doubt something awful will result. Bummer, that, but it's the price of using high technology. We can have high speed, just not predictability.

Scared yet? I was, when reading the September 3, 2001 issue of EE Times about Intel's McKinley processor. One of it's neat features is an L3 cache located on-board the CPU. This about doubles some performance figures.

The cache is big and the geometry of the part quite tiny. It's fabricated with .18 micron line widths, which is astonishingly small but the wavelength of red light. When light wavelengths look big, surely we've falling through the looking glass into the realm of the fantastic.

In the article Intel acknowledged that this small size coupled with the large surface area of the four megabyte L3 cache creates a serious target for incoming cosmic rays. There's very little energy difference in such small parts between a zero and a one, so occasionally the cache may get struck by an incoming particle of just the right energy to flip a bit. What happens next is hard to predict. Perhaps the processor will execute a cache-miss, causing a flush and reload. But maybe not. A cache with wrong data, even when it's just a single bit in error, is a disaster waiting to happen.

How bad is the problem? Apparently they've looked at this statistically, and feel that odds are the average user will see less than one bit flip per thousand years. That's a very long time. I can't help but wonder if Intel can justify a 1000 year MTBF partly because no one expects their Windows-based PC to run for more than a day or two between crashes anyway.

Can embedded systems live with a 1000 year MTBF? Perhaps. Unless one considers that high volume production means worldwide, cache bits will flip rather often. Build a million of a widget with these reliability parameters, and statistics suggest 1000 bit flips per year, divided among all of the deployed products. That is, maybe 1000 crashes/year due just to this one infrequently encountered hardware flaw.

Suppose the airlines retrofit all commercial planes with a super-gee-whiz new kind of avionics that includes just a single McKinley CPU. I read somewhere that, on average, there are 7000 airliners flying at any time. Will this cause 7 crashes per year? Or maybe even more, considering that at high altitudes radiation sources punch more energy.

How ironic that a huge plane might be downed by a collision with a single elementary particle!

This is not a rant against Intel or the McKinley part. I think the company has courage to admit there's an issue. The problem stems from the evolution of our technology. All vendors pushing into the deep submicron will have to grapple with erratic hardware as transistor sizes approach quantum limits. The PC industry generally leads embedded by a few years as they pioneer high density parts, but we do follow not behind. We embedded developers can expect to see these sorts of problems in our systems before too long.

If 180 nanometer line widths are subject to cosmic rays then what does the future hold? Most roadmaps suggest that leading-edge parts will use 50 nanometer geometries within 10 years. To put this in perspective, a hydrogen atom is about an angstrom across. In a decade our line widths will span a mere 500 atoms. It's hard to conceive of building anything so small, especially when a single chip will have a billion transistors made from these tiny building blocks.

Intel predicts these near-future parts will run at 20 GHz. Even CISC parts are evolving towards the RISC ideal of one clock per instruction, so that processor will execute an instruction in 50 picoseconds! So not only are the transistors miniscule, they operate at speeds that give RF designers fits.

Heat, too, conspires against chip designers. The McKinley reportedly dissipates 130 watts. A few of these CPUs would make a fine toaster, and no doubt that much embedded intelligence will ensure the perfect piece of toast, every time.

I can't find a McKinley datasheet, but the previous generation Itanium shows a max current requirement (processor plus cache) of about 90 amps. This is not a misprint: 90 amps. That's not much less a Honda's engine cranking current. It will drain a Die Hard in half an hour.

The part comes with quite an elaborate thermal management subsystem that automatically degrades performance as temperature rises, to avoid self-destruction.

Heat has always been the enemy of electronics. High temperatures destroy semiconductors. Thermal stresses can cause leads to pop free of circuit boards. Connectors unseat themselves, PCBs change shape, and all sorts of mechanical issues stemming from high temps reduce system reliability. I can't help but wonder if powering systems up and down, with the resulting heating/cooling cycles, will create failures. Once we thought semiconductors lasted forever. Will this be true in the future?

It's really quite marvelous that a $1000 PC running at a GHz or so works as well as it does. Think about the Sagan-like billions of bits - maybe closer to hundreds of billions - running around your machine each second. Even when the machine is "idle" (whatever that means) it's still slinging data at rates undreamt of a decade ago. Yet if just one of those bits flips, or is misinterpreted, or goes unstable for a nanosecond or two, the machine will crash.

Our traditional assumption of reliable hardware will not be valid in the high density, very fast, and quite hot world that's just around the corner. Intel's disclosure of cosmic ray vulnerabilities may actually be a wake-up call to the computer community that it's time to rethink our designs.

Perhaps the vendors will counter with redundant hardware. In some cases this makes a lot of sense. It's pretty easy - though expensive - to widen memory arrays and add error correction codes that detect and correct one and two bit errors. Much more difficult is creating reliable CPU redundancies for a reasonable price. Today the cache is at risk, but in the 50 nanometer 1 billion transistor processors of the near future, surely much of the part's internal logic will be susceptible to random bit flips as well. The space shuttle manages hardware failures using five "voting" computers, so we know it's possible to create such redundancy. But the costs are astronomical. The shuttle is an economic anomaly funded by taxpayer money - there's plenty more cash where that came from. Most embedded systems must adhere to a much less generous financial model.

Nothing New

But the hardware has always been unreliable. Older engineers will remember the transition to 16K (16384 bits, that is) DRAMs in the 70s. New process techniques produced memories subject to cosmic rays. Here we are a quarter century later with similar problems! Most of us at the time fought with flaky DRAM arrays, blaming our designs when the chips themselves were at fault. New kinds of plastic packages were invented, until the vendors discovered that some plastics generated alpha particles, which also flipped bits. After a year or two the problems were resolved.

With the dawn of the PC age DRAM demands skyrocketed. The parts were much less perfect than those we use today, so all PCs used 9 bit wide memory, the 9^th bit being devoted to parity. Today's reliable parts has mostly eliminated this - when was the last time you saw a parity error on a PC?

In the embedded world we design products that might be used for decades. Yet some components won't last that long. Many vendors only guarantee that their EPROMs, EEROMs and Flash devices will retain programmed data for ten years (some, like AMD, offer parts that remember much longer). After a decade, the program might flit away like a puff of smoke in the wind. Many vendors don't spec a data retention time, leaving designers to speculate just when the firmware will become even less than the ghost in the machine.

So we've had some issues with hardware reliability already, but have mostly dealt with the problem by ignoring it, assuming we'll be safely employed elsewhere when the day of reckoning comes.

I suspect we're going to see new kinds of firmware development techniques, designed to manage erratic hardware behavior. One obvious approach is a memory management unit that wraps protected areas around each task. A crash brings just a single task down, though what action the software takes at that point becomes very problematic. Perhaps future reliable systems will need a failure mode analysis that includes recovery from such crashes.

Firmware complexity will soar.